Text Mining - moRe than woRds

Sanjiv Ranjan Das and Karthik Mokashi

UseR @Stanford – June 2016

Reference monograph

Text expands the universe of data by many-fold. See my monograph on text mining in finance at: http://srdas.github.io/Das_TextAnalyticsInFinance.pdf

This covers some of the content of this presentation. These files are useful for the talk itself and you may run the program code as we proceed.

http://srdas.github.io/Temp/user2016/

Text as Data

  1. Big Text: there is more textual data than numerical data.
  2. Text is versatile. Nuances and behavioral expressions that are not conveyed with numbers.
  3. Text contains emotive content. Sentiment analysis. Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005; Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008; Leinweber-Sisk 2010.
  4. Text contains opinions and connections. Das et al 2005; Das and Sisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.
  5. Numbers aggregate; text disaggregates.

Anecdotal …

  1. In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.
  2. Chris Anderson: “Data is the New Theory.”
  3. These issues are relevant to text mining, but let’s put them on hold till the end of the session.

Definition: Text-Mining

  1. Text mining is the large-scale, automated processing of plain text language in digital form to extract data that is converted into useful quantitative or qualitative information.
  2. Text mining is automated on big data that is not amenable to human processing within reasonable time frames. It entails extracting data that is converted into information of many types.
  3. Simple: Text mining may be simple as in key word searches and counts.
  4. Complicated: It may require language parsing and complex rules for information extraction.
  5. Structured text, such as the information in forms and some kinds of web pages.
  6. Unstructured text is a much harder endeavor.
  7. Text mining is also aimed at unearthing unseen relationships in unstructured text as in meta analyses of research papers, see Van Noorden 2012.

Definition: News Analytics

Wikipedia defines it as - “… the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way. News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.”

https://www.amazon.com/Handbook-News-Analytics-Finance/dp/047066679X/ref=sr_1_1?ie=UTF8&qid=1466897817&sr=8-1&keywords=handbook+of+news+analytics

Data and Algorithms

Text Extraction

The R programming language is increasingly being used to download text from the web and then analyze it. The ease with which R may be used to scrape text from web site may be seen from the following simple command in R:

text = readLines("http://srdas.github.io/bio-candid.html")
text[15:20]
## [1] "being an academic, he worked in the derivatives business in the"      
## [2] "Asia-Pacific region as a Vice-President at Citibank. His current"     
## [3] "research interests include: the modeling of default risk, machine"    
## [4] "learning, social networks, derivatives pricing models, portfolio"     
## [5] "theory, and venture capital. He has published over eighty articles in"
## [6] "academic journals, and has won numerous awards for research and"

Here, we downloaded the my bio page from my university’s web site. It’s a simple HTML file.

length(text)
## [1] 79

String Parsing

Suppose we just want the 17th line, we do:

text[17]
## [1] "research interests include: the modeling of default risk, machine"

And, to find out the character length of the this line we use the function:

library(stringr)
str_length(text[17])
## [1] 65

We have first invoked the library stringr that contains many string handling functions. In fact, we may also get the length of each line in the text vector by applying the function length() to the entire text vector.

text_len = str_length(text)
print(text_len)
##  [1]  6 69  0 66 70 70 70 63 69 65 68 67 64 67 63 64 65 64 69 63 68 70 39
## [24]  0  0 56  0 65 67 66 65 64 66 69 63 69 65 27  0  3  0 71 71 69 68 71
## [47] 12  0  3  0 71 70 68 71 69 63 67 69 64 67  7  0  3  0 67 71 65 63 72
## [70] 69 68 66 69 70 70 43  0  0  0
print(text_len[55])
## [1] 69
text_len[17]
## [1] 65

Sort by Length

Some lines are very long and are the ones we are mainly interested in as they contain the bulk of the story, whereas many of the remaining lines that are shorter contain html formatting instructions. Thus, we may extract the top three lengthy lines with the following set of commands.

res = sort(text_len,decreasing=TRUE,index.return=TRUE)
idx = res$ix
text2 = text[idx]
text2
##  [1] "important to open the academic door to the ivory tower and let the world"
##  [2] "Sanjiv is now a Professor of Finance at Santa Clara University. He came" 
##  [3] "to SCU from Harvard Business School and spent a year at UC Berkeley. In" 
##  [4] "previous lives into his current existence, which is incredibly confused" 
##  [5] "Sanjiv's research style is instilled with a distinct \"New York state of"
##  [6] "funds, the internet, portfolio choice, banking models, credit risk, and" 
##  [7] "ocean.  The many walks in Greenwich village convinced him that there is" 
##  [8] "Santa Clara University's Leavey School of Business. He previously held"  
##  [9] "faculty appointments as Associate Professor at Harvard Business School"  
## [10] "and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and"  
## [11] "published in May 2010.  He currently also serves as a Senior Fellow at"  
## [12] "mind\" - it is chaotic, diverse, with minimal method to the madness. He" 
## [13] "any time you like, but you can never leave.\" Which is why he is doomed" 
## [14] "to a lifetime in Hotel California. And he believes that, if this is as"  
## [15] "<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">" 
## [16] "Berkeley), an MBA from the Indian Institute of Management, Ahmedabad,"   
## [17] "theory, and venture capital. He has published over eighty articles in"   
## [18] "science fiction movies, and writing cool software code. When there is"   
## [19] "academic papers, which helps him relax. Always the contrarian, Sanjiv"   
## [20] "his past life in the unreal world, Sanjiv worked at Citibank, N.A. in"   
## [21] "has unpublished articles in many other areas. Some years ago, he took"   
## [22] "There he learnt about the fascinating field of Randomized Algorithms,"   
## [23] "in. Academia is a real challenge, given that he has to reconcile many"   
## [24] "explains, you never really finish your education - \"you can check out"  
## [25] "College), and is also a qualified Cost and Works Accountant. He is a"    
## [26] "teaching. His recent book \"Derivatives: Principles and Practice\" was"  
## [27] "the Asia-Pacific region. He takes great pleasure in merging his many"    
## [28] "has published articles on derivatives, term-structure models, mutual"    
## [29] "more opinions than ideas. He has been known to have turned down many"    
## [30] "senior editor of The Journal of Investment Management, co-editor of"     
## [31] "Research, and Associate Editor of other academic journals. Prior to"     
## [32] "growing up, Sanjiv moved to New York to change the world, hopefully"     
## [33] "confirming that an unchecked hobby can quickly become an obsession."     
## [34] "pursuits, many of which stem from being in the epicenter of Silicon"     
## [35] "Coastal living did a lot to mold Sanjiv, who needs to live near the"     
## [36] "Sanjiv Das is the William and Janice Terry Professor of Finance at"      
## [37] "through research.  He graduated in 1994 with a Ph.D. from NYU, and"      
## [38] "mountains meet the sea, riding sport motorbikes, reading, gadgets,"      
## [39] "offers from Mad magazine to publish his academic work. As he often"      
## [40] "B.Com in Accounting and Economics (University of Bombay, Sydenham"       
## [41] "research interests include: the modeling of default risk, machine"       
## [42] "After loafing and working in many parts of Asia, but never really"       
## [43] "since then spent five years in Boston, and now lives in San Jose,"       
## [44] "thinks that New York City is the most calming place in the world,"       
## [45] "no such thing as a representative investor, yet added many unique"       
## [46] "The Journal of Derivatives and The Journal of Financial Services"        
## [47] "Asia-Pacific region as a Vice-President at Citibank. His current"        
## [48] "learning, social networks, derivatives pricing models, portfolio"        
## [49] "California.  Sanjiv loves animals, places in the world where the"        
## [50] "skills he now applies earnestly to his editorial work, and other"        
## [51] "Ph.D. from New York University), Computer Science (M.S. from UC"         
## [52] "being an academic, he worked in the derivatives business in the"         
## [53] "academic journals, and has won numerous awards for research and"         
## [54] "time available from the excitement of daily life, Sanjiv writes"         
## [55] "time off to get another degree in computer science at Berkeley,"         
## [56] "features to his personal utility function. He learnt that it is"         
## [57] "<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>"                
## [58] "bad as it gets, life is really pretty good."                             
## [59] "the FDIC Center for Financial Research."                                 
## [60] "after California of course."                                             
## [61] "and diverse."                                                            
## [62] "Valley."                                                                 
## [63] "<HTML>"                                                                  
## [64] "<p>"                                                                     
## [65] "<p>"                                                                     
## [66] "<p>"                                                                     
## [67] ""                                                                        
## [68] ""                                                                        
## [69] ""                                                                        
## [70] ""                                                                        
## [71] ""                                                                        
## [72] ""                                                                        
## [73] ""                                                                        
## [74] ""                                                                        
## [75] ""                                                                        
## [76] ""                                                                        
## [77] ""                                                                        
## [78] ""                                                                        
## [79] ""

Text cleanup

In short, text extraction can be exceedingly simple, though getting clean text is not as easy an operation. Removing html tags and other unnecessary elements in the file is also a fairly simple operation. We undertake the following steps that use generalized regular expressions (i.e., grep) to eliminate html formatting characters.

This will generate one single paragraph of text, relatively clean of formatting characters. Such a text collection is also known as a “bag of words”.

text = paste(text,collapse="\n")
print(text)
## [1] "<HTML>\n<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">\n\nSanjiv Das is the William and Janice Terry Professor of Finance at\nSanta Clara University's Leavey School of Business. He previously held\nfaculty appointments as Associate Professor at Harvard Business School\nand UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and\nPh.D. from New York University), Computer Science (M.S. from UC\nBerkeley), an MBA from the Indian Institute of Management, Ahmedabad,\nB.Com in Accounting and Economics (University of Bombay, Sydenham\nCollege), and is also a qualified Cost and Works Accountant. He is a\nsenior editor of The Journal of Investment Management, co-editor of\nThe Journal of Derivatives and The Journal of Financial Services\nResearch, and Associate Editor of other academic journals. Prior to\nbeing an academic, he worked in the derivatives business in the\nAsia-Pacific region as a Vice-President at Citibank. His current\nresearch interests include: the modeling of default risk, machine\nlearning, social networks, derivatives pricing models, portfolio\ntheory, and venture capital. He has published over eighty articles in\nacademic journals, and has won numerous awards for research and\nteaching. His recent book \"Derivatives: Principles and Practice\" was\npublished in May 2010.  He currently also serves as a Senior Fellow at\nthe FDIC Center for Financial Research.\n\n\n<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>\n\nAfter loafing and working in many parts of Asia, but never really\ngrowing up, Sanjiv moved to New York to change the world, hopefully\nthrough research.  He graduated in 1994 with a Ph.D. from NYU, and\nsince then spent five years in Boston, and now lives in San Jose,\nCalifornia.  Sanjiv loves animals, places in the world where the\nmountains meet the sea, riding sport motorbikes, reading, gadgets,\nscience fiction movies, and writing cool software code. When there is\ntime available from the excitement of daily life, Sanjiv writes\nacademic papers, which helps him relax. Always the contrarian, Sanjiv\nthinks that New York City is the most calming place in the world,\nafter California of course.\n\n<p>\n\nSanjiv is now a Professor of Finance at Santa Clara University. He came\nto SCU from Harvard Business School and spent a year at UC Berkeley. In\nhis past life in the unreal world, Sanjiv worked at Citibank, N.A. in\nthe Asia-Pacific region. He takes great pleasure in merging his many\nprevious lives into his current existence, which is incredibly confused\nand diverse.\n\n<p>\n\nSanjiv's research style is instilled with a distinct \"New York state of\nmind\" - it is chaotic, diverse, with minimal method to the madness. He\nhas published articles on derivatives, term-structure models, mutual\nfunds, the internet, portfolio choice, banking models, credit risk, and\nhas unpublished articles in many other areas. Some years ago, he took\ntime off to get another degree in computer science at Berkeley,\nconfirming that an unchecked hobby can quickly become an obsession.\nThere he learnt about the fascinating field of Randomized Algorithms,\nskills he now applies earnestly to his editorial work, and other\npursuits, many of which stem from being in the epicenter of Silicon\nValley.\n\n<p>\n\nCoastal living did a lot to mold Sanjiv, who needs to live near the\nocean.  The many walks in Greenwich village convinced him that there is\nno such thing as a representative investor, yet added many unique\nfeatures to his personal utility function. He learnt that it is\nimportant to open the academic door to the ivory tower and let the world\nin. Academia is a real challenge, given that he has to reconcile many\nmore opinions than ideas. He has been known to have turned down many\noffers from Mad magazine to publish his academic work. As he often\nexplains, you never really finish your education - \"you can check out\nany time you like, but you can never leave.\" Which is why he is doomed\nto a lifetime in Hotel California. And he believes that, if this is as\nbad as it gets, life is really pretty good.\n\n\n"
text = str_replace_all(text,"[<>{}()&;,.\n]"," ")
print(text)
## [1] " HTML   BODY background=\"http://algo scu edu/~sanjivdas/graphics/back2 gif\"   Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara University's Leavey School of Business  He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley  He holds post-graduate degrees in Finance  M Phil and Ph D  from New York University   Computer Science  M S  from UC Berkeley   an MBA from the Indian Institute of Management  Ahmedabad  B Com in Accounting and Economics  University of Bombay  Sydenham College   and is also a qualified Cost and Works Accountant  He is a senior editor of The Journal of Investment Management  co-editor of The Journal of Derivatives and The Journal of Financial Services Research  and Associate Editor of other academic journals  Prior to being an academic  he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank  His current research interests include: the modeling of default risk  machine learning  social networks  derivatives pricing models  portfolio theory  and venture capital  He has published over eighty articles in academic journals  and has won numerous awards for research and teaching  His recent book \"Derivatives: Principles and Practice\" was published in May 2010   He currently also serves as a Senior Fellow at the FDIC Center for Financial Research     p   B Sanjiv Das: A Short Academic Life History /B   p   After loafing and working in many parts of Asia  but never really growing up  Sanjiv moved to New York to change the world  hopefully through research   He graduated in 1994 with a Ph D  from NYU  and since then spent five years in Boston  and now lives in San Jose  California   Sanjiv loves animals  places in the world where the mountains meet the sea  riding sport motorbikes  reading  gadgets  science fiction movies  and writing cool software code  When there is time available from the excitement of daily life  Sanjiv writes academic papers  which helps him relax  Always the contrarian  Sanjiv thinks that New York City is the most calming place in the world  after California of course    p   Sanjiv is now a Professor of Finance at Santa Clara University  He came to SCU from Harvard Business School and spent a year at UC Berkeley  In his past life in the unreal world  Sanjiv worked at Citibank  N A  in the Asia-Pacific region  He takes great pleasure in merging his many previous lives into his current existence  which is incredibly confused and diverse    p   Sanjiv's research style is instilled with a distinct \"New York state of mind\" - it is chaotic  diverse  with minimal method to the madness  He has published articles on derivatives  term-structure models  mutual funds  the internet  portfolio choice  banking models  credit risk  and has unpublished articles in many other areas  Some years ago  he took time off to get another degree in computer science at Berkeley  confirming that an unchecked hobby can quickly become an obsession  There he learnt about the fascinating field of Randomized Algorithms  skills he now applies earnestly to his editorial work  and other pursuits  many of which stem from being in the epicenter of Silicon Valley    p   Coastal living did a lot to mold Sanjiv  who needs to live near the ocean   The many walks in Greenwich village convinced him that there is no such thing as a representative investor  yet added many unique features to his personal utility function  He learnt that it is important to open the academic door to the ivory tower and let the world in  Academia is a real challenge  given that he has to reconcile many more opinions than ideas  He has been known to have turned down many offers from Mad magazine to publish his academic work  As he often explains  you never really finish your education - \"you can check out any time you like  but you can never leave \" Which is why he is doomed to a lifetime in Hotel California  And he believes that  if this is as bad as it gets  life is really pretty good    "

XML Package

The XML package in R also comes with many functions that aid in cleaning up text and dropping it (mostly unformatted) into a flat file or data frame. This may then be further processed. Here is some example code for this.

Processing XML files in R into a data frame

The following example has been adapted from r-bloggers.com. It uses the following URL:

http://www.w3schools.com/xml/plant_catalog.xml

library(XML)
## Warning: package 'XML' was built under R version 3.2.4
#Part1: Reading an xml and creating a data frame with it.

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
xmlfile <- xmlTreeParse(xml.url)
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]
##                COMMON              BOTANICAL ZONE        LIGHT
## 1           Bloodroot Sanguinaria canadensis    4 Mostly Shady
## 2           Columbine   Aquilegia canadensis    3 Mostly Shady
## 3      Marsh Marigold       Caltha palustris    4 Mostly Sunny
## 4             Cowslip       Caltha palustris    4 Mostly Shady
## 5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady

Creating a XML file from a data frame

#Example adapted from https://stat.ethz.ch/pipermail/r-help/2008-September/175364.html
#Load the iris data set and create a data frame
data("iris")
data <- as.data.frame(iris)

xml <- xmlTree()
xml$addTag("document", close=FALSE)
## Warning in xmlRoot.XMLInternalDocument(currentNodes[[1]]): empty XML
## document
for (i in 1:nrow(data)) {
  xml$addTag("row", close=FALSE)
  for (j in names(data)) {
    xml$addTag(j, data[i, j])
  }
  xml$closeTag()
}
xml$closeTag()

#view the xml
cat(saveXML(xml))
## <?xml version="1.0"?>
## 
## <document>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.5</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.9</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.7</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.6</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.6</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.4</Sepal.Length>
##     <Sepal.Width>3.9</Sepal.Width>
##     <Petal.Length>1.7</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.6</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.4</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.9</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.1</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.4</Sepal.Length>
##     <Sepal.Width>3.7</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.8</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.8</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.1</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.3</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>1.1</Petal.Length>
##     <Petal.Width>0.1</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>4</Sepal.Width>
##     <Petal.Length>1.2</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>4.4</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.4</Sepal.Length>
##     <Sepal.Width>3.9</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.5</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>3.8</Sepal.Width>
##     <Petal.Length>1.7</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.8</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.4</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.7</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.7</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.6</Sepal.Length>
##     <Sepal.Width>3.6</Sepal.Width>
##     <Petal.Length>1</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.3</Sepal.Width>
##     <Petal.Length>1.7</Petal.Length>
##     <Petal.Width>0.5</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.8</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.9</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.2</Sepal.Length>
##     <Sepal.Width>3.5</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.2</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.7</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.8</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.4</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.2</Sepal.Length>
##     <Sepal.Width>4.1</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.1</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>4.2</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.9</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>1.2</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>3.5</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.9</Sepal.Length>
##     <Sepal.Width>3.6</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.1</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.4</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.5</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.5</Sepal.Length>
##     <Sepal.Width>2.3</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.4</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>1.3</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.5</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.6</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.8</Sepal.Width>
##     <Petal.Length>1.9</Petal.Length>
##     <Petal.Width>0.4</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.8</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.3</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>3.8</Sepal.Width>
##     <Petal.Length>1.6</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.6</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.3</Sepal.Length>
##     <Sepal.Width>3.7</Sepal.Width>
##     <Petal.Length>1.5</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>3.3</Sepal.Width>
##     <Petal.Length>1.4</Petal.Length>
##     <Petal.Width>0.2</Petal.Width>
##     <Species>setosa</Species>
##   </row>
##   <row>
##     <Sepal.Length>7</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>4.7</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.9</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>4.9</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>2.3</Sepal.Width>
##     <Petal.Length>4</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.5</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.6</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>3.3</Sepal.Width>
##     <Petal.Length>4.7</Petal.Length>
##     <Petal.Width>1.6</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.9</Sepal.Length>
##     <Sepal.Width>2.4</Sepal.Width>
##     <Petal.Length>3.3</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.6</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>4.6</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.2</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>3.9</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>2</Sepal.Width>
##     <Petal.Length>3.5</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.9</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.2</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6</Sepal.Length>
##     <Sepal.Width>2.2</Sepal.Width>
##     <Petal.Length>4</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.1</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>4.7</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.6</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>3.6</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>4.4</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.6</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>4.1</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.2</Sepal.Length>
##     <Sepal.Width>2.2</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.6</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>3.9</Petal.Length>
##     <Petal.Width>1.1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.9</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>4.8</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.1</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>4.9</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.1</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.7</Petal.Length>
##     <Petal.Width>1.2</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>4.3</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.6</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.4</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.8</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.8</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5</Petal.Length>
##     <Petal.Width>1.7</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>2.6</Sepal.Width>
##     <Petal.Length>3.5</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>2.4</Sepal.Width>
##     <Petal.Length>3.8</Petal.Length>
##     <Petal.Width>1.1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>2.4</Sepal.Width>
##     <Petal.Length>3.7</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>3.9</Petal.Length>
##     <Petal.Width>1.2</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>1.6</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.4</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.6</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>4.7</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>2.3</Sepal.Width>
##     <Petal.Length>4.4</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.6</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.1</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>4</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.5</Sepal.Length>
##     <Sepal.Width>2.6</Sepal.Width>
##     <Petal.Length>4.4</Petal.Length>
##     <Petal.Width>1.2</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.1</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.6</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>2.6</Sepal.Width>
##     <Petal.Length>4</Petal.Length>
##     <Petal.Width>1.2</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5</Sepal.Length>
##     <Sepal.Width>2.3</Sepal.Width>
##     <Petal.Length>3.3</Petal.Length>
##     <Petal.Width>1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.6</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>4.2</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.2</Petal.Length>
##     <Petal.Width>1.2</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>4.2</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.2</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>4.3</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.1</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>3</Petal.Length>
##     <Petal.Width>1.1</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.1</Petal.Length>
##     <Petal.Width>1.3</Petal.Width>
##     <Species>versicolor</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>3.3</Sepal.Width>
##     <Petal.Length>6</Petal.Length>
##     <Petal.Width>2.5</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>1.9</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.1</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.9</Petal.Length>
##     <Petal.Width>2.1</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>5.6</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.5</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.8</Petal.Length>
##     <Petal.Width>2.2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.6</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>6.6</Petal.Length>
##     <Petal.Width>2.1</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>4.9</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>4.5</Petal.Length>
##     <Petal.Width>1.7</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.3</Sepal.Length>
##     <Sepal.Width>2.9</Sepal.Width>
##     <Petal.Length>6.3</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>5.8</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.2</Sepal.Length>
##     <Sepal.Width>3.6</Sepal.Width>
##     <Petal.Length>6.1</Petal.Length>
##     <Petal.Width>2.5</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.5</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>5.3</Petal.Length>
##     <Petal.Width>1.9</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.8</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.5</Petal.Length>
##     <Petal.Width>2.1</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.7</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>5</Petal.Length>
##     <Petal.Width>2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>2.4</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>5.3</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.5</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.5</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.7</Sepal.Length>
##     <Sepal.Width>3.8</Sepal.Width>
##     <Petal.Length>6.7</Petal.Length>
##     <Petal.Width>2.2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.7</Sepal.Length>
##     <Sepal.Width>2.6</Sepal.Width>
##     <Petal.Length>6.9</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6</Sepal.Length>
##     <Sepal.Width>2.2</Sepal.Width>
##     <Petal.Length>5</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.9</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>5.7</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.6</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.9</Petal.Length>
##     <Petal.Width>2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.7</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>6.7</Petal.Length>
##     <Petal.Width>2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>4.9</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3.3</Sepal.Width>
##     <Petal.Length>5.7</Petal.Length>
##     <Petal.Width>2.1</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.2</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>6</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.2</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>4.8</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.1</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.9</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>5.6</Petal.Length>
##     <Petal.Width>2.1</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.2</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.8</Petal.Length>
##     <Petal.Width>1.6</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.4</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>6.1</Petal.Length>
##     <Petal.Width>1.9</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.9</Sepal.Length>
##     <Sepal.Width>3.8</Sepal.Width>
##     <Petal.Length>6.4</Petal.Length>
##     <Petal.Width>2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>5.6</Petal.Length>
##     <Petal.Width>2.2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>2.8</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>1.5</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.1</Sepal.Length>
##     <Sepal.Width>2.6</Sepal.Width>
##     <Petal.Length>5.6</Petal.Length>
##     <Petal.Width>1.4</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>7.7</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>6.1</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>5.6</Petal.Length>
##     <Petal.Width>2.4</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.4</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>5.5</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>4.8</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.9</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>5.4</Petal.Length>
##     <Petal.Width>2.1</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>5.6</Petal.Length>
##     <Petal.Width>2.4</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.9</Sepal.Length>
##     <Sepal.Width>3.1</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.8</Sepal.Length>
##     <Sepal.Width>2.7</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>1.9</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.8</Sepal.Length>
##     <Sepal.Width>3.2</Sepal.Width>
##     <Petal.Length>5.9</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3.3</Sepal.Width>
##     <Petal.Length>5.7</Petal.Length>
##     <Petal.Width>2.5</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.7</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.2</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.3</Sepal.Length>
##     <Sepal.Width>2.5</Sepal.Width>
##     <Petal.Length>5</Petal.Length>
##     <Petal.Width>1.9</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.5</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.2</Petal.Length>
##     <Petal.Width>2</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>6.2</Sepal.Length>
##     <Sepal.Width>3.4</Sepal.Width>
##     <Petal.Length>5.4</Petal.Length>
##     <Petal.Width>2.3</Petal.Width>
##     <Species>virginica</Species>
##   </row>
##   <row>
##     <Sepal.Length>5.9</Sepal.Length>
##     <Sepal.Width>3</Sepal.Width>
##     <Petal.Length>5.1</Petal.Length>
##     <Petal.Width>1.8</Petal.Width>
##     <Species>virginica</Species>
##   </row>
## </document>

The Response to News

Das, Martinez-Jerez, and Tufano (FM 2005)

Breakdown of News Flow

Frequency of Postings

Weekly Posting

Intraday Posting

Number of Characters per Posting

Text Handling

First, let’s read in a simple web page (my landing page)

text = readLines("http://srdas.github.io/")
print(text[1:4])
## [1] "<html>"                                          
## [2] ""                                                
## [3] "<head>"                                          
## [4] "<title>SCU Web Page of Sanjiv Ranjan Das</title>"
print(length(text))
## [1] 36

String Detection

String handling is a basic need, so we use the stringr package.

#EXTRACTING SUBSTRINGS (take some time to look at
#the "stringr" package also)
library(stringr)
substr(text[4],24,29)
## [1] "Sanjiv"
#IF YOU WANT TO LOCATE A STRING
res = regexpr("Sanjiv",text[4])
print(res)
## [1] 24
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
print(substr(text[4],res[1],res[1]+nchar("Sanjiv")-1))
## [1] "Sanjiv"
#ANOTHER WAY
res = str_locate(text[4],"Sanjiv")
print(res)
##      start end
## [1,]    24  29
print(substr(text[4],res[1],res[2]))
## [1] "Sanjiv"

Cleaning Text

Now we look at using regular expressions with the grep command to clean out text. I will read in my research page to process this. Here we are undertaking a “ruthless” cleanup.

#SIMPLE TEXT HANDLING
text = readLines("http://srdas.github.io/research.htm")
print(length(text))
## [1] 794
print(text)
##   [1] "<HTML>"                                                                                                                                                                                                                                                                                                          
##   [2] "<HEAD>"                                                                                                                                                                                                                                                                                                          
##   [3] "<TITLE>Research of Professor Sanjiv Ranjan Das</TITLE>"                                                                                                                                                                                                                                                          
##   [4] "<BASE HREF=\"http://srdas.github.io/\">"                                                                                                                                                                                                                                                                         
##   [5] "</HEAD>"                                                                                                                                                                                                                                                                                                         
##   [6] "<BODY background=\"http://srdas.github.io/graphics/back2.gif\">"                                                                                                                                                                                                                                                 
##   [7] ""                                                                                                                                                                                                                                                                                                                
##   [8] "<H2>BOOKS and MONOGRAPHS</H2>"                                                                                                                                                                                                                                                                                   
##   [9] ""                                                                                                                                                                                                                                                                                                                
##  [10] "<OL reversed>"                                                                                                                                                                                                                                                                                                   
##  [11] ""                                                                                                                                                                                                                                                                                                                
##  [12] "<LI><img src=\"graphics/DSTMAA.png\" width=\"50\" height=\"65\">"                                                                                                                                                                                                                                                
##  [13] "\"Data Science: Theories, Models, Algorithms, and Analytics\" (web book -- work in progress)"                                                                                                                                                                                                                    
##  [14] "<a href=\"http://srdas.github.io/Papers/DSA_Book.pdf\">Read here.</a>"                                                                                                                                                                                                                                           
##  [15] ""                                                                                                                                                                                                                                                                                                                
##  [16] ""                                                                                                                                                                                                                                                                                                                
##  [17] "<LI><img src=\"graphics/derbook_cover.png\" width=\"50\" height=\"65\">"                                                                                                                                                                                                                                         
##  [18] "\"Derivatives: Principles and Practice\" (2010),"                                                                                                                                                                                                                                                                
##  [19] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."                                                                                                                                                                                                                                                              
##  [20] "<a href=\"http://www.amazon.com/Derivatives-Rangarajan-Sundaram/dp/0072949317/ref=sr_1_1?ie=UTF8&s=books&qid=1268798971&sr=8-1\">[Amazon]</a>"                                                                                                                                                                   
##  [21] "<a href=\"http://productsearch.barnesandnoble.com/search/results.aspx?WRD=sundaram+das\">[BarnesNoble]</a>"                                                                                                                                                                                                      
##  [22] ""                                                                                                                                                                                                                                                                                                                
##  [23] "</OL>"                                                                                                                                                                                                                                                                                                           
##  [24] ""                                                                                                                                                                                                                                                                                                                
##  [25] "<H2>REFEREED JOURNAL PUBLICATIONS</H2>"                                                                                                                                                                                                                                                                          
##  [26] ""                                                                                                                                                                                                                                                                                                                
##  [27] "<OL reversed>"                                                                                                                                                                                                                                                                                                   
##  [28] ""                                                                                                                                                                                                                                                                                                                
##  [29] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
##  [30] "\"An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."                                                                                                                                                                                                                             
##  [31] "Forthcoming, <I>Journal of Banking and Finance</I>."                                                                                                                                                                                                                                                             
##  [32] "<br>[<I> [Develops a new measure of liquidity for all sectors of the markets using ETFs. "                                                                                                                                                                                                                       
##  [33] "This paper won the S&P SPIVA 2012 Award for innovation of an index.</I>]"                                                                                                                                                                                                                                        
##  [34] "<a href=\"Papers/etfliq.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
##  [35] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [36] ""                                                                                                                                                                                                                                                                                                                
##  [37] "<LI><img src=\"graphics/JAI.png\" width=\"55\" height=\"40\">"                                                                                                                                                                                                                                                   
##  [38] "\"Matrix Metrics: Network-Based Systemic Risk Scoring\", (2016)."                                                                                                                                                                                                                                                
##  [39] "<I>Journal of Alternative Investments</I>, Special Issue on Systemic Risk, v18(4), 33-51."                                                                                                                                                                                                                       
##  [40] "<br>[<I>A new approach to identifying system-wide financial risk, SIFIs, and several other measures"                                                                                                                                                                                                             
##  [41] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "                                                                                                                                                                                                                           
##  [42] "the best paper on SIFIs (systemically important financial institutions). "                                                                                                                                                                                                                                       
##  [43] "It also won the best paper award at "                                                                                                                                                                                                                                                                            
##  [44] "the R Finance conference, Chicago 2015. </I>]"                                                                                                                                                                                                                                                                   
##  [45] "<a href=\"Papers/JAI_Das_issue.pdf\">[PDF of paper]</a>"                                                                                                                                                                                                                                                         
##  [46] "<a href=\"Papers/JAI_EditorsLetter_issue.pdf\">[Editor's letter re Special Issue]</a>"                                                                                                                                                                                                                           
##  [47] "<a href=\"Papers/JAI_Getmansky_Stein_issue.pdf\">[Editor's overview]</a>"                                                                                                                                                                                                                                        
##  [48] "<a href=\"Papers/RiskNetworks_slides_RFinance_2015_05.pdf\">[SLIDES RFinance]</a>. "                                                                                                                                                                                                                             
##  [49] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [50] ""                                                                                                                                                                                                                                                                                                                
##  [51] ""                                                                                                                                                                                                                                                                                                                
##  [52] ""                                                                                                                                                                                                                                                                                                                
##  [53] ""                                                                                                                                                                                                                                                                                                                
##  [54] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
##  [55] "\"Credit Spreads with Dynamic Debt\" (with Seoyoung Kim), (2015), "                                                                                                                                                                                                                                              
##  [56] "<I>Journal of Banking and Finance</I>, v50, 121-140."                                                                                                                                                                                                                                                            
##  [57] "<a href=\"Papers/DasKim_JBF2015_FINAL.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                           
##  [58] "<br>[<I>Extends the Merton risky debt model from static debt to dynamic debt"                                                                                                                                                                                                                                    
##  [59] "and generates credit spread term structures that are closer to those in the data</I>]"                                                                                                                                                                                                                           
##  [60] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [61] ""                                                                                                                                                                                                                                                                                                                
##  [62] "<LI><img src=\"graphics/FTF.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
##  [63] "\"Text and Context: Language Analytics for Finance\", (2014),"                                                                                                                                                                                                                                                   
##  [64] "<I>Foundations and Trends in Finance</I>, v8(3), 145-260. "                                                                                                                                                                                                                                                      
##  [65] "<a href=\"Papers/Das_TextAnalyticsInFinance.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                     
##  [66] "<br>[<I>A comprehensive survey of comcepts, tools, techniques, and empirical "                                                                                                                                                                                                                                   
##  [67] "literature on textual processing in finance.</I>]"                                                                                                                                                                                                                                                               
##  [68] ""                                                                                                                                                                                                                                                                                                                
##  [69] ""                                                                                                                                                                                                                                                                                                                
##  [70] "<LI><img src=\"graphics/jfe.gif\" width=\"40\" height=\"55\">\"Did CDS Trading Improve the Market for Corporate Bonds?\" (with Madhu Kalimipalli and Subhankar Nayak), (2014), <I>Journal of Financial Economics</I> 111, 495-525."                                                                              
##  [71] "<a href=\"Papers/cdsbondeff.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
##  [72] "<br>[<I>The inception of CDS trading in a reference name renders its bonds less efficient, with no improvement in market quality or liquidity</I>]"                                                                                                                                                              
##  [73] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [74] ""                                                                                                                                                                                                                                                                                                                
##  [75] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
##  [76] "\"Strategic Loan Modification: An Options-Based Response to Strategic Default,\""                                                                                                                                                                                                                                
##  [77] "(with Ray Meadows), (2013), <I>Journal of Banking and Finance</I> 37, 636-647. "                                                                                                                                                                                                                                 
##  [78] "<a href=\"Papers/sam.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                            
##  [79] "<br>[<I>A closed-form solution for mortgage debt with default and optimal loan modificatoin thereon.</I>]"                                                                                                                                                                                                       
##  [80] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [81] ""                                                                                                                                                                                                                                                                                                                
##  [82] ""                                                                                                                                                                                                                                                                                                                
##  [83] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
##  [84] "\"Options and Structured Products in Behavioral Portfolios,\" (with Meir Statman), (2013), "                                                                                                                                                                                                                     
##  [85] "<I>Journal of Economic Dynamics and Control</I>, 37(1), 137-153."                                                                                                                                                                                                                                                
##  [86] "<a href=\"Papers/JEDC_FINAL_PROOF.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                               
##  [87] "<br>[<I>Explores the roles in behavioral portfolios of option collars, capital guaranteed notes, "                                                                                                                                                                                                               
##  [88] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."                                                                                                                                                                                                                                  
##  [89] "</I>]"                                                                                                                                                                                                                                                                                                           
##  [90] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [91] ""                                                                                                                                                                                                                                                                                                                
##  [92] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                           
##  [93] "\"The Principal Principle,\" (2012), <I>Journal of Financial and QuantitativeAnalysis</I>, 47(6), 1215-1246.  "                                                                                                                                                                                                  
##  [94] "<a href=\"http://journals.cambridge.org/repo_A884JKBk\">[PDF]</a>"                                                                                                                                                                                                                                               
##  [95] "<br>[<I>Optimal approaches for mortgage loan modification. Principal reduction is optimal, and better than rate reductions, maturity extensions, and principal forebearance. Shared-appreciation mortgages solve moral hazard.</I>]"                                                                             
##  [96] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [97] ""                                                                                                                                                                                                                                                                                                                
##  [98] "<LI><img src=\"graphics/IEEE.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                                 
##  [99] "\"Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study,\" (2011), (with Douglas Burdick, Mauricio A. Hernandez, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan), <I>IEEE Data Engineering Bulletin</I>, 34(3), 60-67."
## [100] "<a href=\"Papers/midaswww2011_FINAL.pdf\">[PDF older version]</a>"                                                                                                                                                                                                                                               
## [101] "<a href=\"Papers/midas-deb_July2011.pdf\">[PDF final version]</a>"                                                                                                                                                                                                                                               
## [102] ""                                                                                                                                                                                                                                                                                                                
## [103] "<LI><img src=\"graphics/jfint_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                          
## [104] "\"Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance,\" (2011), (with Hoje Jo and Yongtae Kim), "                                                                                                                                                                                    
## [105] "<I>Journal of Financial Intermediation</I> 20(2), 199--230."                                                                                                                                                                                                                                                     
## [106] "<a href=\"Papers/synd.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [107] "<br>[<I>Syndicate-financed firms fare better---higher return multiples come from better selection, but time-to-exit and likelihood of exit are better on accont of superior monitoring by the syndicate.</I>]"                                                                                                   
## [108] "</LI>"                                                                                                                                                                                                                                                                                                           
## [109] ""                                                                                                                                                                                                                                                                                                                
## [110] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\"> \"Portfolio"                                                                                                                                                                                                                                
## [111] "Optimization with Mental Accounts,\" (2010), (with Harry Markowitz, Jonathan"                                                                                                                                                                                                                                    
## [112] "Scheid, and Meir Statman),  <I>Journal of Financial and Quantitative"                                                                                                                                                                                                                                            
## [113] "Analysis</I>, v45(2), 311-334."                                                                                                                                                                                                                                                                                  
## [114] "<a href=\"http://journals.cambridge.org/repo_A772rEdS\">[PDF (copyright: Cambridge University Press)]</a>"                                                                                                                                                                                                       
## [115] "<br>[<I>Mean-variance optimization is reconciled with behavioral porfolio theory. Mental "                                                                                                                                                                                                                       
## [116] "accounts optimization leads to better aggregate portfolios.</I>]"                                                                                                                                                                                                                                                
## [117] "</LI>"                                                                                                                                                                                                                                                                                                           
## [118] ""                                                                                                                                                                                                                                                                                                                
## [119] "<LI><img src=\"graphics/jcr.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [120] "\"The Long and Short of it: Why are stocks with shorter run-lengths preferred?\" (2010), (with Priya Raghubir), <I>Journal of Consumer Research</I>. 36(6), 964-982."                                                                                                                                            
## [121] "<a href=\"Papers/runlength.pdf\">[PDF]</a>, "                                                                                                                                                                                                                                                                    
## [122] "<a href=\"Papers/runlength_summary.pdf\">[Non-technical summary]</a>"                                                                                                                                                                                                                                            
## [123] "<br>[<I>People responding to stock charts are systematically biased against stocks with longer run lengths, even if these stocks are no riskier than those with shorter runs.</I>]"                                                                                                                              
## [124] "</LI>"                                                                                                                                                                                                                                                                                                           
## [125] ""                                                                                                                                                                                                                                                                                                                
## [126] ""                                                                                                                                                                                                                                                                                                                
## [127] "<LI><img src=\"graphics/anor.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                  
## [128] "\"Run Lengths and Liquidity,\" (with Paul Hanouna), (2010), <I>Annals of Operations Resarch</I>, Special Issue on Risk and Uncertainty, 176(1), 127-152."                                                                                                                                                        
## [129] "<a href=\"Papers/rs.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                             
## [130] "<br>[<I>The run signature of a stock is shown to be mathematically related to liquidity. Runs are "                                                                                                                                                                                                              
## [131] "priced factors. </I>]"                                                                                                                                                                                                                                                                                           
## [132] "</LI>"                                                                                                                                                                                                                                                                                                           
## [133] ""                                                                                                                                                                                                                                                                                                                
## [134] ""                                                                                                                                                                                                                                                                                                                
## [135] ""                                                                                                                                                                                                                                                                                                                
## [136] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [137] "\"Implied Recovery,'' (with Paul Hanouna), (2009), <I>Journal of Economic Dynamics and Control</I>, 33(11), 1837-1857."                                                                                                                                                                                          
## [138] "<a href=\"Papers/imprec.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [139] "<br>[<I>How to use the term structure of CDS spreads to jointly identify the term structures of forward default probability and recovery rates.  </I>]"                                                                                                                                                          
## [140] "</LI>"                                                                                                                                                                                                                                                                                                           
## [141] ""                                                                                                                                                                                                                                                                                                                
## [142] ""                                                                                                                                                                                                                                                                                                                
## [143] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [144] "\"Accounting-based versus market-based cross-sectional models of CDS spreads,\" "                                                                                                                                                                                                                                
## [145] "(with Paul Hanouna and Atulya Sarin), (2009), "                                                                                                                                                                                                                                                                  
## [146] "<I>Journal of Banking and Finance</I>, 33, 719-730.  "                                                                                                                                                                                                                                                           
## [147] "<a href=\"Papers/JBF_final_3.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [148] "<br>[<I>Accounting models explain spreads as well as market-based ones, but a hybrid mix does best.</I>]"                                                                                                                                                                                                        
## [149] "</LI>"                                                                                                                                                                                                                                                                                                           
## [150] ""                                                                                                                                                                                                                                                                                                                
## [151] ""                                                                                                                                                                                                                                                                                                                
## [152] "<LI><img src=\"graphics/jfint_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                          
## [153] "\"Hedging Credit: Equity Liquidity Matters,\" (with Paul Hanouna), (2009),"                                                                                                                                                                                                                                      
## [154] "<I>Journal of Financial Intermediation</I>, v18(1), 112-123"                                                                                                                                                                                                                                                     
## [155] "<a href=\"Papers/cdsliq.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [156] "<br>[<I>Hedging in CDS markets provides a mechanism by which equity market liquidity impacts CDS spreads </I>]"                                                                                                                                                                                                  
## [157] "</LI>"                                                                                                                                                                                                                                                                                                           
## [158] ""                                                                                                                                                                                                                                                                                                                
## [159] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                              
## [160] "\"An Integrated Model for Hybrid Securities,\""                                                                                                                                                                                                                                                                  
## [161] "(with Raghu Sundaram), (2007), <I>Management Science</I>, v53, 1439-1451."                                                                                                                                                                                                                                       
## [162] "<a href=\"Papers/rsx_FINAL.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                      
## [163] "<br>[<I>A general flexible model for pricing derivative securities that depend on equity, "                                                                                                                                                                                                                      
## [164] "interest rate and credit risk, using observables. Delivers dynamic implied default probabilities.</I>]"                                                                                                                                                                                                          
## [165] "</LI>"                                                                                                                                                                                                                                                                                                           
## [166] ""                                                                                                                                                                                                                                                                                                                
## [167] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                              
## [168] "\"Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,\""                                                                                                                                                                                                                                          
## [169] "(with Mike Chen), (2007), <I>Management Science</I>, v53, 1375-1388."                                                                                                                                                                                                                                            
## [170] "<a href=\"Papers/chat_FINAL.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [171] "<br>[<I>A methodology for parsing internet stock chat to develop a sentiment index. Assesses"                                                                                                                                                                                                                    
## [172] "whether small traders opinions contain information not in prices. </I>]"                                                                                                                                                                                                                                         
## [173] "</LI>"                                                                                                                                                                                                                                                                                                           
## [174] ""                                                                                                                                                                                                                                                                                                                
## [175] "<LI><img src=\"graphics/JF_cover.jpg\" width=\"120\" height=\"55\">"                                                                                                                                                                                                                                             
## [176] "\"Common Failings: How Corporate Defaults are Correlated\" "                                                                                                                                                                                                                                                     
## [177] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."                                                                                                                                                                                                                                                        
## [178] "(2007) <I>Journal of Finance</I>, v62, 93-117. "                                                                                                                                                                                                                                                                 
## [179] "<a href=\"Papers/ddks.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [180] "<br>[<I>New approach to test for defaul contagion using a stochastic time change. "                                                                                                                                                                                                                              
## [181] "Doubly stochastic models are refuted by the data.</I>]"                                                                                                                                                                                                                                                          
## [182] "</LI>"                                                                                                                                                                                                                                                                                                           
## [183] ""                                                                                                                                                                                                                                                                                                                
## [184] "<LI><img src=\"graphics/fmalogo_main.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                          
## [185] "\"A Clinical Study of Investor Discussion and Sentiment,\" "                                                                                                                                                                                                                                                     
## [186] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "                                                                                                                                                                                                                                                             
## [187] "<I>Financial Management</I>, v34(5), 103-137."                                                                                                                                                                                                                                                                   
## [188] "<a href=\"Papers/einfo.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                          
## [189] "<br>[<I>Examines the interaction of chat room information and news. </I>]"                                                                                                                                                                                                                                       
## [190] "</LI>"                                                                                                                                                                                                                                                                                                           
## [191] ""                                                                                                                                                                                                                                                                                                                
## [192] "<LI><img src=\"graphics/JF_cover.jpg\" width=\"120\" height=\"55\">"                                                                                                                                                                                                                                             
## [193] "\"International Portfolio Choice with Systemic Risk,\""                                                                                                                                                                                                                                                          
## [194] "(with Raman Uppal), 2004, <I>Journal of Finance</I>, v59(6), 2809-2834."                                                                                                                                                                                                                                         
## [195] "<a href=\"Papers/systemic.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [196] "<br>[<I>A model for portfolio optimization with systemic risk. "                                                                                                                                                                                                                                                 
## [197] "The loss resulting from diminished diversification is small, while"                                                                                                                                                                                                                                              
## [198] "that from holding very highly levered positions is large. </I>]"                                                                                                                                                                                                                                                 
## [199] "</LI>"                                                                                                                                                                                                                                                                                                           
## [200] ""                                                                                                                                                                                                                                                                                                                
## [201] "<LI><img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\"> \"Fee"                                                                                                                                                                                                                                       
## [202] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"                                                                                                                                                                                                                                             
## [203] "Investor Welfare,'' (with Rangarajan Sundaram), 2002, <i>Review of"                                                                                                                                                                                                                                              
## [204] "Financial Studies</i>, v15, 1465-1497."                                                                                                                                                                                                                                                                          
## [205] "<a href=\"Papers/fees.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [206] "<br><I>[Compares fulcrum vs incentive fees structures from the standpoint of "                                                                                                                                                                                                                                   
## [207] "investor welfare. Contrary to regulatory intuition, incentive structures"                                                                                                                                                                                                                                        
## [208] "are often optimal.] </I>"                                                                                                                                                                                                                                                                                        
## [209] "</LI>"                                                                                                                                                                                                                                                                                                           
## [210] ""                                                                                                                                                                                                                                                                                                                
## [211] "<LI><img src=\"graphics/FAJ_cover.gif\" width=\"140\" height=\"55\">"                                                                                                                                                                                                                                            
## [212] "\"A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"                                                                                                                                                                                                                                        
## [213] "with Rating Transitions,\" (with Viral Acharya and Rangarajan Sundaram),"                                                                                                                                                                                                                                        
## [214] "2002, <I>Financial Analysts Journal</I>, May-June, 28-44."                                                                                                                                                                                                                                                       
## [215] "<a href=\"Papers/dsmarkov.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [216] "<br><I>[A HJM type two-factor model in risk free rates and spreads that also accounts "                                                                                                                                                                                                                          
## [217] "for rating transitions, allowing seamless pricing of many credit derivatives. ] </I>"                                                                                                                                                                                                                            
## [218] "</LI>"                                                                                                                                                                                                                                                                                                           
## [219] ""                                                                                                                                                                                                                                                                                                                
## [220] "<LI><img src=\"graphics/JOE_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [221] "\"The Surprise Element: Jumps in Interest Rates\", 2002, <I>Journal of"                                                                                                                                                                                                                                          
## [222] "Econometrics</I>, v106, 27-65."                                                                                                                                                                                                                                                                                  
## [223] "<a href=\"Papers/jump.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [224] "<br><I>[Estimation methodology for interest rates with jumps. A flexible "                                                                                                                                                                                                                                       
## [225] "specification that accommodates Federal Reserve Activity.]</I>"                                                                                                                                                                                                                                                  
## [226] "</LI>"                                                                                                                                                                                                                                                                                                           
## [227] ""                                                                                                                                                                                                                                                                                                                
## [228] "<LI><img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [229] "\"Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"                                                                                                                                                                                                                                 
## [230] "  2002, <I>Review of Financial Studies</I>, v15(1), 195-241."                                                                                                                                                                                                                                                    
## [231] "<a href=\"Papers/affine.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [232] "<br><I>[General affine option pricing for interest rate derivatives covering a "                                                                                                                                                                                                                                 
## [233] "wide range of securities, allowing for M factors with N diffusions and L jumps.] </I>"                                                                                                                                                                                                                           
## [234] "</LI>"                                                                                                                                                                                                                                                                                                           
## [235] ""                                                                                                                                                                                                                                                                                                                
## [236] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                              
## [237] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                                                                                                                                                                                                  
## [238] "(with Rangarajan Sundaram), 2000, <I>Management Science</I>, v46(1), 46-62."                                                                                                                                                                                                                                     
## [239] "<a href=\"msfinal.ps\">[PS]</a>"                                                                                                                                                                                                                                                                                 
## [240] "<br><I>[HJM style two factor model for credit risk. ] </I>"                                                                                                                                                                                                                                                      
## [241] "</LI>"                                                                                                                                                                                                                                                                                                           
## [242] ""                                                                                                                                                                                                                                                                                                                
## [243] "<LI><img src=\"graphics/FAJ_cover.gif\" width=\"140\" height=\"55\">"                                                                                                                                                                                                                                            
## [244] "\"The Psychology of Financial Decision Making: A Case"                                                                                                                                                                                                                                                           
## [245] "for Theory-Driven Experimental Enquiry,''"                                                                                                                                                                                                                                                                       
## [246] "1999, (with Priya Raghubir),"                                                                                                                                                                                                                                                                                    
## [247] "<I>Financial Analyst's Journal</I>, Nov-Dec 1999, v55(6), 56-79."                                                                                                                                                                                                                                                
## [248] "<br><I>[Surveys the anomalies literature in Finance and shows how experimental"                                                                                                                                                                                                                                  
## [249] "studies may be used to disentangle competing hypotheses for the same anomaly.]</I>"                                                                                                                                                                                                                              
## [250] "</LI>"                                                                                                                                                                                                                                                                                                           
## [251] ""                                                                                                                                                                                                                                                                                                                
## [252] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [253] "\"Of Smiles and Smirks: A Term Structure Perspective,''"                                                                                                                                                                                                                                                         
## [254] "1999, (with Rangarajan Sundaram), <I>Journal of"                                                                                                                                                                                                                                                                 
## [255] "Financial and Quantitative Analysis</I>, v34(2), 211-240."                                                                                                                                                                                                                                                       
## [256] "<a href=\"Papers/skew.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [257] "<br><I>[Explains how the shape of the volatility smile is determined by "                                                                                                                                                                                                                                        
## [258] "jumps and stochastic volatility. ]</I>"                                                                                                                                                                                                                                                                          
## [259] "</LI>"                                                                                                                                                                                                                                                                                                           
## [260] ""                                                                                                                                                                                                                                                                                                                
## [261] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [262] "\"A Theory of Banking Structure,\" 1999, (with Ashish Nanda),"                                                                                                                                                                                                                                                   
## [263] "<I>Journal of Banking and Finance</I>, v23(6), 863-895."                                                                                                                                                                                                                                                         
## [264] "<br><I>[A theory to analyze the specialization of banking activities based "                                                                                                                                                                                                                                     
## [265] "by function based upon two dimensions: the degree of information asymmetry "                                                                                                                                                                                                                                     
## [266] "and the degree of verifiability of the value of the service rendered. ]</I>"                                                                                                                                                                                                                                     
## [267] "</LI>"                                                                                                                                                                                                                                                                                                           
## [268] ""                                                                                                                                                                                                                                                                                                                
## [269] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [270] "\"A Theory of Optimal Timing and Selectivity,'' "                                                                                                                                                                                                                                                                
## [271] "(with George Chacko), 1999, <I>Journal of"                                                                                                                                                                                                                                                                       
## [272] "Economic Dynamics and Control</I>, v23(7), 929-966."                                                                                                                                                                                                                                                             
## [273] "<br><I>[Dynamic optimal portfolio choice model for determining optimal effort"                                                                                                                                                                                                                                   
## [274] "allocation to timing and stock selection in asset allocation.]</I>"                                                                                                                                                                                                                                              
## [275] "</LI>"                                                                                                                                                                                                                                                                                                           
## [276] ""                                                                                                                                                                                                                                                                                                                
## [277] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [278] "\"A Direct Discrete-Time Approach to"                                                                                                                                                                                                                                                                            
## [279] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "                                                                                                                                                                                                                                                
## [280] "Model,\" 1999, <I>Journal of Economic Dynamics and Control</I>, v23(3), 333-369."                                                                                                                                                                                                                                
## [281] "<br><I>[HJM tree with jumps. Fast, fully recombining dynamics. ] </I>"                                                                                                                                                                                                                                           
## [282] "</LI>"                                                                                                                                                                                                                                                                                                           
## [283] ""                                                                                                                                                                                                                                                                                                                
## [284] "<LI><img src=\"graphics/RESTAT_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                          
## [285] "\"The Central Tendency: A Second Factor in"                                                                                                                                                                                                                                                                      
## [286] "Bond Yields,\" 1998, (with Silverio Foresi and Pierluigi Balduzzi),  "                                                                                                                                                                                                                                           
## [287] "<I>The Review of Economics and Statistics</I>, v80(1), 60-72."                                                                                                                                                                                                                                                   
## [288] "<br><I>[Model of the term structure with stochastic long-run mean. Related to "                                                                                                                                                                                                                                  
## [289] "Federal Reserve acitivity.]</I>"                                                                                                                                                                                                                                                                                 
## [290] "<a href=\"Papers/BalduzziDasForesi_ReStat1998_CentralTendency.pdf\">[PDF]</a>"                                                                                                                                                                                                                                   
## [291] "</LI>"                                                                                                                                                                                                                                                                                                           
## [292] ""                                                                                                                                                                                                                                                                                                                
## [293] "<LI> <img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [294] "\"Efficiency with Costly Information: A Reinterpretation of"                                                                                                                                                                                                                                                     
## [295] "Evidence from Managed Portfolios,\" (with Edwin Elton, Martin Gruber and Matt "                                                                                                                                                                                                                                  
## [296] "Hlavka), <I>Review of Financial Studies</I>, vol. 6(1), 1993, pp 1-22. "                                                                                                                                                                                                                                         
## [297] "<a href=\"Papers/EGDH.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [298] "<br><I>[Mutual funds are not informationally efficient. "                                                                                                                                                                                                                                                        
## [299] "You are better off buying the index.]  </I>"                                                                                                                                                                                                                                                                     
## [300] "<br>"                                                                                                                                                                                                                                                                                                            
## [301] "Presented and Reprinted in the Proceedings of The "                                                                                                                                                                                                                                                              
## [302] "Seminar on the Analysis of Security Prices at the Center "                                                                                                                                                                                                                                                       
## [303] "for Research in Security   Prices  at the University of "                                                                                                                                                                                                                                                        
## [304] "Chicago, Graduate School of Business. </LI>"                                                                                                                                                                                                                                                                     
## [305] ""                                                                                                                                                                                                                                                                                                                
## [306] ""                                                                                                                                                                                                                                                                                                                
## [307] ""                                                                                                                                                                                                                                                                                                                
## [308] ""                                                                                                                                                                                                                                                                                                                
## [309] ""                                                                                                                                                                                                                                                                                                                
## [310] "<H2>MORE REFEREED JOURNAL PUBLICATIONS</H2>"                                                                                                                                                                                                                                                                     
## [311] ""                                                                                                                                                                                                                                                                                                                
## [312] ""                                                                                                                                                                                                                                                                                                                
## [313] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [314] "\"Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework\" (2014), "                                                                                                                                                                                                                            
## [315] "(with Seoyoung Kim and Meir Statman),  "                                                                                                                                                                                                                                                                         
## [316] "<I>Journal of Portfolio Management</I>, 41(1), 95-108."                                                                                                                                                                                                                                                          
## [317] "<a href=\"Papers/underfunded.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [318] "<br><I>[Provides a new definition of underfunded portfolios, and compares four remedies for underfunding.]</I>"                                                                                                                                                                                                  
## [319] "</LI>"                                                                                                                                                                                                                                                                                                           
## [320] ""                                                                                                                                                                                                                                                                                                                
## [321] ""                                                                                                                                                                                                                                                                                                                
## [322] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                            
## [323] "\"Going for Broke: Restructuring Distressed Debt Portfolios\" (2014),"                                                                                                                                                                                                                                           
## [324] "(with Seoyoung Kim), <I>Journal of Fixed Income</I>, 24(3), 5-27."                                                                                                                                                                                                                                               
## [325] "<a href=\"Papers/ddo.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                            
## [326] "<br><I>[Optimizing portfolios where the return distributions of the assets is endogenous. The gains from restructuring distressed debt portfolios are large.]</I>"                                                                                                                                               
## [327] "</LI>"                                                                                                                                                                                                                                                                                                           
## [328] ""                                                                                                                                                                                                                                                                                                                
## [329] ""                                                                                                                                                                                                                                                                                                                
## [330] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [331] "\"Digital Portfolios.\" (2013), "                                                                                                                                                                                                                                                                                
## [332] "<I>Journal of Portfolio Management</I>, v39(2), 41-48."                                                                                                                                                                                                                                                          
## [333] "<a href=\"Papers/vport.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                          
## [334] "<br><I>[Constructing portfolios of assets with a binary payoff, large versus zero, and the differences in this optimization versus standard mean-variance portfolio construction.]</I>"                                                                                                                          
## [335] "</LI>"                                                                                                                                                                                                                                                                                                           
## [336] ""                                                                                                                                                                                                                                                                                                                
## [337] ""                                                                                                                                                                                                                                                                                                                
## [338] "<LI><img src=\"graphics/frl.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [339] "\"Options on Portfolios with Higher-Order Moments,\" (2009),"                                                                                                                                                                                                                                                    
## [340] "(with Rishabh Bhandari), <I>Finance Research Letters</I>, v6, 122-129. "                                                                                                                                                                                                                                         
## [341] "<a href=\"Papers/tensor.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [342] "<br><I>[How to model fat-tailed portfolio distributions for "                                                                                                                                                                                                                                                    
## [343] "options on a multivariate system of assets, calibrated to the return "                                                                                                                                                                                                                                           
## [344] "means, covariance matrix, coskewness and cokurtosis tensors.]</I>"                                                                                                                                                                                                                                               
## [345] "</LI>"                                                                                                                                                                                                                                                                                                           
## [346] ""                                                                                                                                                                                                                                                                                                                
## [347] ""                                                                                                                                                                                                                                                                                                                
## [348] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [349] "\"Dealing with Dimension: Option Pricing on Factor Trees,\" (2009),"                                                                                                                                                                                                                                             
## [350] "(with Brian Granger), <I>Journal of Investment Management</I>, 7(2), 73-85."                                                                                                                                                                                                                                     
## [351] "<a href=\"Papers/faclat.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [352] "<br><I>[Multifactor representations of securities on high-dimensional trees. Allows "                                                                                                                                                                                                                            
## [353] "you to price options on multiple assets in a unified fraamework. Computational"                                                                                                                                                                                                                                  
## [354] "results assess using multithreading.]</I>"                                                                                                                                                                                                                                                                       
## [355] "</LI>"                                                                                                                                                                                                                                                                                                           
## [356] ""                                                                                                                                                                                                                                                                                                                
## [357] ""                                                                                                                                                                                                                                                                                                                
## [358] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                            
## [359] "\"Modeling"                                                                                                                                                                                                                                                                                                      
## [360] "Correlated Default with a Forest of Binomial Trees,\" (2007), (with"                                                                                                                                                                                                                                             
## [361] "Santhosh Bandreddi and Rong Fan), <I>Journal of Fixed"                                                                                                                                                                                                                                                           
## [362] "Income</I>. Winter, 1-20."                                                                                                                                                                                                                                                                                       
## [363] "<a href=\"Papers/bscorrdef.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                      
## [364] "<br><I>[Extends the Das-Sundaram hybrid securities model to correlated default modeling.  ]</I>"                                                                                                                                                                                                                 
## [365] "</LI>"                                                                                                                                                                                                                                                                                                           
## [366] ""                                                                                                                                                                                                                                                                                                                
## [367] "<LI><img src=\"graphics/jfsr_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [368] "\"Basel II: Correlation Related Issues\" (2007), "                                                                                                                                                                                                                                                               
## [369] "<I>Journal of Financial Services Research</I>, v32, 17-38."                                                                                                                                                                                                                                                      
## [370] "<a href=\"Papers/Das_JFSR2007_Basel2.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                            
## [371] "<br><I>[Analysis of correlation related issues arising in the implementation"                                                                                                                                                                                                                                    
## [372] "of the Basel II accord.]</I>"                                                                                                                                                                                                                                                                                    
## [373] "</LI>"                                                                                                                                                                                                                                                                                                           
## [374] ""                                                                                                                                                                                                                                                                                                                
## [375] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [376] "\"Correlated Default Risk,\" (2006),"                                                                                                                                                                                                                                                                            
## [377] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                                                                                                                                                                                                           
## [378] "<I>Journal of Fixed Income</I>, Fall 2006, 7-32."                                                                                                                                                                                                                                                                
## [379] "<a href=\"Papers/DasFreedGengKapadia_JFI2006.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                    
## [380] "<br><I>[Empirical evidence on the nature of credit correlations. Correlations"                                                                                                                                                                                                                                   
## [381] "increase as markets worsen. Regime switching models are needed to explain dynamic"                                                                                                                                                                                                                               
## [382] "correlations.]</I>"                                                                                                                                                                                                                                                                                              
## [383] "</LI>"                                                                                                                                                                                                                                                                                                           
## [384] ""                                                                                                                                                                                                                                                                                                                
## [385] "<LI><img src=\"graphics/qfcover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                               
## [386] "\"A Simple Model for Pricing Equity Options with Markov"                                                                                                                                                                                                                                                         
## [387] "Switching State Variables\" (2006),"                                                                                                                                                                                                                                                                             
## [388] "(with Donald Aingworth and Rajeev Motwani),"                                                                                                                                                                                                                                                                     
## [389] "<I>Quantitative Finance</I>, v6(2), 95-105."                                                                                                                                                                                                                                                                     
## [390] "<a href=\"Papers/switch.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [391] "<br><I>[A tree model for options when the underlying has regime switches.]</I>"                                                                                                                                                                                                                                  
## [392] "</LI>"                                                                                                                                                                                                                                                                                                           
## [393] ""                                                                                                                                                                                                                                                                                                                
## [394] "<LI><img src=\"graphics/mktletters.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [395] "\"The Firm's Management of Social Interactions,\" (2005)"                                                                                                                                                                                                                                                        
## [396] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "                                                                                                                                                                                                                                                    
## [397] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "                                                                                                                                                                                                                                                       
## [398] "<I>Marketing Letters</I>, v16, 415-428.Ê"                                                                                                                                                                                                                                                                        
## [399] "<br><I>[A framework for how word-of-mouth communication is modeled in "                                                                                                                                                                                                                                          
## [400] "the practice of marketing.   ]</I>"                                                                                                                                                                                                                                                                              
## [401] "</LI>"                                                                                                                                                                                                                                                                                                           
## [402] ""                                                                                                                                                                                                                                                                                                                
## [403] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [404] "\"Financial Communities\" (with Jacob Sisk), 2005, "                                                                                                                                                                                                                                                             
## [405] "<i>Journal of Portfolio Management</i>, v31(4), "                                                                                                                                                                                                                                                                
## [406] "Summer, 112-123."                                                                                                                                                                                                                                                                                                
## [407] "<a href=\"Papers/fincom.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [408] "<br><I>[Applying graph theory to understanding investor networks to "                                                                                                                                                                                                                                            
## [409] "develop trading rules. ]</I>"                                                                                                                                                                                                                                                                                    
## [410] "</LI>"                                                                                                                                                                                                                                                                                                           
## [411] ""                                                                                                                                                                                                                                                                                                                
## [412] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [413] "\"Monte Carlo Markov Chain Methods for Derivative Pricing"                                                                                                                                                                                                                                                       
## [414] "and Risk Assessment,\"(with Alistair Sinclair), 2005, "                                                                                                                                                                                                                                                          
## [415] "<I>Journal of Investment Management</I>, v3(1), 29-44. "                                                                                                                                                                                                                                                         
## [416] "<a href=\"https://www.joim.com/ArticleContainer.asp?artid=125&print=false&Key=GQ6!WiJQSJrlrcVJSoeGhEQF7LVNhzfb0M!Nz!0SO5foSMK6!WiHQSJrlrcVJSoeGhEQ\">[PDF]</a>"                                                                                                                                                  
## [417] "<br><I>[Randomized algorithm using MCMC on very large option pricing trees"                                                                                                                                                                                                                                      
## [418] "where incomplete information about the value of an asset may be exploited to "                                                                                                                                                                                                                                   
## [419] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "                                                                                                                                                                                                                                  
## [420] "approximation scheme (FPRAS) is available.]</I>"                                                                                                                                                                                                                                                                 
## [421] "</LI>"                                                                                                                                                                                                                                                                                                           
## [422] ""                                                                                                                                                                                                                                                                                                                
## [423] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [424] "\"Correlated Default Processes: A Criterion-Based Copula Approach,\""                                                                                                                                                                                                                                            
## [425] "(with Gary Geng), 2004, <I>Journal of Investment Management</I>, v2(2), 44-70,"                                                                                                                                                                                                                                  
## [426] "Special Issue on Default Risk. "                                                                                                                                                                                                                                                                                 
## [427] "<a href=\"https://www.joim.com/ArticleContainer.asp?artid=70&print=false&Key=GQ6!WiJQSJrlrcVJSoeGhEJF7LVNhzfb0M!Nz!0SO5foSMK6!WiHQSJrlrcVJSoeGhEJ\">[PDF]</a>"                                                                                                                                                   
## [428] "<br><I>[Which copula and marginal distributions best describe default probability"                                                                                                                                                                                                                               
## [429] "correlations? Develops models and methodology to answer this question. ]</I>"                                                                                                                                                                                                                                    
## [430] "</LI>"                                                                                                                                                                                                                                                                                                           
## [431] ""                                                                                                                                                                                                                                                                                                                
## [432] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [433] "\"Private Equity Returns: An Empirical Examination of the Exit of"                                                                                                                                                                                                                                               
## [434] "Venture-Backed Companies,\" (with Murali Jagannathan and Atulya Sarin),"                                                                                                                                                                                                                                         
## [435] "2003, <I>Journal of Investment Management</I>, v1(1), 152-177."                                                                                                                                                                                                                                                  
## [436] "<a href=\"Papers/PE_returns.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [437] "<br><I>[Gains from venture-backed investments depend upon the industry, the stage of the"                                                                                                                                                                                                                        
## [438] "firm being financed, the valuation at the time of financing, and the prevailing market"                                                                                                                                                                                                                          
## [439] "sentiment. Helps understand the risk premium required for the"                                                                                                                                                                                                                                                   
## [440] "valuation of private equity investments  ]</I>"                                                                                                                                                                                                                                                                  
## [441] "</LI>"                                                                                                                                                                                                                                                                                                           
## [442] ""                                                                                                                                                                                                                                                                                                                
## [443] "<LI><img src=\"graphics/IJISAFM_cover.gif\" width=\"40\" height=\"55\"> \"A"                                                                                                                                                                                                                                     
## [444] "Numerical Algorithm for Consumption/Investment Problems,\" (with Rangarajan"                                                                                                                                                                                                                                     
## [445] "Sundaram), 2002, <I>International Journal of Intelligent"                                                                                                                                                                                                                                                        
## [446] "Systems in Accounting, Finance and Management</I>, (Special"                                                                                                                                                                                                                                                     
## [447] "Issue on Computational Methods in Economics and Finance),  "                                                                                                                                                                                                                                                     
## [448] "December, 55-69."                                                                                                                                                                                                                                                                                                
## [449] "<a href=\"Papers/hjb.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                            
## [450] "<br><I>[A simple regression approach to solving optimal consumption"                                                                                                                                                                                                                                             
## [451] "and portfolio problems wit diffusions and jumps.]</I>"                                                                                                                                                                                                                                                           
## [452] "</LI>"                                                                                                                                                                                                                                                                                                           
## [453] ""                                                                                                                                                                                                                                                                                                                
## [454] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [455] "\"Bayesian Migration in Credit Ratings Based on Probabilities of"                                                                                                                                                                                                                                                
## [456] "Default,\" (with Rong Fan and Gary Geng), 2002, <I>Journal of"                                                                                                                                                                                                                                                   
## [457] "Fixed Income</I>, December, v12(3), 17-23.  "                                                                                                                                                                                                                                                                    
## [458] "<a href=\"Papers/ratingmigr.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [459] "<br><I>[Bayesian model for predicting rating changes based on the"                                                                                                                                                                                                                                               
## [460] "dynamics of default probabilities.]</I>"                                                                                                                                                                                                                                                                         
## [461] "</LI>"                                                                                                                                                                                                                                                                                                           
## [462] ""                                                                                                                                                                                                                                                                                                                
## [463] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [464] "\"The Impact of Correlated Default Risk on Credit Portfolios,\""                                                                                                                                                                                                                                                 
## [465] "(with Gifford Fong, and Gary Geng),"                                                                                                                                                                                                                                                                             
## [466] "2001, <i>Journal of Fixed Income</i>, v11(3), 9-19."                                                                                                                                                                                                                                                             
## [467] "<br><I>[The connection between credit portfolio loss distributions"                                                                                                                                                                                                                                              
## [468] "and credit correlations. ]</I>"                                                                                                                                                                                                                                                                                  
## [469] "</LI>"                                                                                                                                                                                                                                                                                                           
## [470] ""                                                                                                                                                                                                                                                                                                                
## [471] "<LI><img src=\"graphics/CIR_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [472] "\"How Diversified are Internationally Diversified Portfolios:"                                                                                                                                                                                                                                                   
## [473] "Time-Variation in the Covariances between International Returns,\""                                                                                                                                                                                                                                              
## [474] "1998, (with Raman Uppal), <I>Canadian Investment Review</I>, Spring, 7-11."                                                                                                                                                                                                                                      
## [475] "<a href=\"Papers/DasUppalCIR1998.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                
## [476] "<br><I>[Internation portfolio risk has systemic components.   ]</I>"                                                                                                                                                                                                                                             
## [477] "</LI>     "                                                                                                                                                                                                                                                                                                      
## [478] ""                                                                                                                                                                                                                                                                                                                
## [479] "<LI><img src=\"graphics/REDR_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [480] "\"Discrete-Time Bond and Option Pricing for Jump-Diffusion"                                                                                                                                                                                                                                                      
## [481] "Processes,\" 1997, <I>Review of Derivatives Research</I>, v1(3), 211-244. "                                                                                                                                                                                                                                      
## [482] "<br><I>[Extends the finite-differencing approach for interest rate derivatives"                                                                                                                                                                                                                                  
## [483] "to jump processes.]</I>"                                                                                                                                                                                                                                                                                         
## [484] "</LI>"                                                                                                                                                                                                                                                                                                           
## [485] ""                                                                                                                                                                                                                                                                                                                
## [486] "<LI><img src=\"graphics/AEL_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [487] "\"Macroeconomic Implications of Search Theory for the Labor Market,\""                                                                                                                                                                                                                                           
## [488] "1997, <I>Applied Economics Letters</I>, December, v4, 719-723."                                                                                                                                                                                                                                                  
## [489] "<br><I>[Connects option pricing theory to labor search theory. Calibrates to "                                                                                                                                                                                                                                   
## [490] "labor market data.]</I>"                                                                                                                                                                                                                                                                                         
## [491] "</LI>"                                                                                                                                                                                                                                                                                                           
## [492] ""                                                                                                                                                                                                                                                                                                                
## [493] "<LI> <img src=\"graphics/FMII_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [494] "\"Auction Theory: A Summary with Applications and Evidence"                                                                                                                                                                                                                                                      
## [495] "from the Treasury Markets,\" 1996, (with Rangarajan Sundaram),"                                                                                                                                                                                                                                                  
## [496] "<I>Financial Markets, Institutions and Instruments</I>, v5(5), 1-36."                                                                                                                                                                                                                                            
## [497] "<a href=\"Papers/DasSundaram_FMII1996_AuctionTheory.pdf\">[PDF]</a>"                                                                                                                                                                                                                                             
## [498] "<br><I>[A survey of models and literature on Treasury Auctions. ]</I>"                                                                                                                                                                                                                                           
## [499] "</LI>"                                                                                                                                                                                                                                                                                                           
## [500] ""                                                                                                                                                                                                                                                                                                                
## [501] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [502] "\"A Simple Approach to Three Factor Affine Models of the"                                                                                                                                                                                                                                                        
## [503] "Term Structure,\" (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"                                                                                                                                                                                                                                      
## [504] "Sundaram), 1996, <I>Journal of Fixed Income</I>, v6(3), 43-53."                                                                                                                                                                                                                                                  
## [505] "<br><I>[ An easy way to calibrate three factor models using method of moments.   ]</I>"                                                                                                                                                                                                                          
## [506] "</LI>"                                                                                                                                                                                                                                                                                                           
## [507] ""                                                                                                                                                                                                                                                                                                                
## [508] "<LI> <img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [509] "\"Analytical Approximations of  the Term Structure"                                                                                                                                                                                                                                                              
## [510] "for Jump-diffusion Processes: A Numerical Analysis,\" 1996, "                                                                                                                                                                                                                                                    
## [511] "(with Jamil Baz), <I>Journal of Fixed Income</I>, v6(1), 78-86. "                                                                                                                                                                                                                                                
## [512] "<br><I>[An exact solution to an approximate PDE may be better than "                                                                                                                                                                                                                                             
## [513] "an approximate solution to an exact PDDE for term structure models. ]</I>"                                                                                                                                                                                                                                       
## [514] "</LI>"                                                                                                                                                                                                                                                                                                           
## [515] ""                                                                                                                                                                                                                                                                                                                
## [516] "<LI> <img src=\"graphics/JAF_cover.jpg\" width=\"40\" height=\"55\"> \"Revisiting"                                                                                                                                                                                                                               
## [517] "Markov Chain Term Structure Models: Extensions and Applications,\""                                                                                                                                                                                                                                              
## [518] "1996, <I>Financial Practice and Education</I>, v6(1), 33-45. "                                                                                                                                                                                                                                                   
## [519] "<br><I>[A new pedagogy for Markov models of interest rates.  ]</I>"                                                                                                                                                                                                                                              
## [520] "</LI>"                                                                                                                                                                                                                                                                                                           
## [521] ""                                                                                                                                                                                                                                                                                                                
## [522] ""                                                                                                                                                                                                                                                                                                                
## [523] "<LI> <img src=\"graphics/REDR_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [524] "\"Exact Solutions for Bond and Options Prices"                                                                                                                                                                                                                                                                   
## [525] "with Systematic Jump Risk,\" 1996, (with Silverio Foresi),"                                                                                                                                                                                                                                                      
## [526] "<I>Review of Derivatives Research</I>, v1(1), 7-24. "                                                                                                                                                                                                                                                            
## [527] "<a href=\"Papers/DasForesiREDR1996.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                              
## [528] "<br><I>[First paper to show that affine solutions exist for "                                                                                                                                                                                                                                                    
## [529] "jump-diffusion term structure models.]</I>"                                                                                                                                                                                                                                                                      
## [530] "</LI>"                                                                                                                                                                                                                                                                                                           
## [531] ""                                                                                                                                                                                                                                                                                                                
## [532] "<LI> <img src=\"graphics/JOD_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [533] "\"Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"                                                                                                                                                                                                                                             
## [534] "and Credit Spreads are Stochastic,\" 1996, "                                                                                                                                                                                                                                                                     
## [535] "(with Peter Tufano), <I>The Journal of Financial Engineering</I>,"                                                                                                                                                                                                                                               
## [536] "v5(2), 161-198."                                                                                                                                                                                                                                                                                                 
## [537] "<a href=\"Papers/DasTufanoJFE1996.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                               
## [538] "<br><I>[Rating based model for credit derivatives with correlation between recovery "                                                                                                                                                                                                                            
## [539] "rates, interest rates and default probabilities. ]</I>"                                                                                                                                                                                                                                                          
## [540] "</LI>"                                                                                                                                                                                                                                                                                                           
## [541] ""                                                                                                                                                                                                                                                                                                                
## [542] "<LI> <img src=\"graphics/JOD_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [543] "\"Credit Risk Derivatives,\" <I>Journal of Derivatives</I>, 1995, pg 7-21. "                                                                                                                                                                                                                                     
## [544] "<a href=\"Papers/Das-JOD1995.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [545] "<br><I>[Introduces early models for pricing credit derivatives as compound options.  ]</I>"                                                                                                                                                                                                                      
## [546] "</LI>"                                                                                                                                                                                                                                                                                                           
## [547] ""                                                                                                                                                                                                                                                                                                                
## [548] ""                                                                                                                                                                                                                                                                                                                
## [549] ""                                                                                                                                                                                                                                                                                                                
## [550] ""                                                                                                                                                                                                                                                                                                                
## [551] ""                                                                                                                                                                                                                                                                                                                
## [552] "<H2>SHORTER ARTICLES and BOOK CHAPTERS (Mostly Non-refereed)</H2>"                                                                                                                                                                                                                                               
## [553] ""                                                                                                                                                                                                                                                                                                                
## [554] "<LI><img src=\"graphics/jwm.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [555] "\"Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier,\" (2011), "                                                                                                                                                                                      
## [556] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "                                                                                                                                                                                                                                                     
## [557] "<I>Journal of Wealth Management</I>, Fall, 14(2), 25-31."                                                                                                                                                                                                                                                        
## [558] "<br><I>[A framework for goal driven mental accounting and behavioral portfolio allocation that extends mean-variance portfolios.]</I>"                                                                                                                                                                           
## [559] "</LI> "                                                                                                                                                                                                                                                                                                          
## [560] ""                                                                                                                                                                                                                                                                                                                
## [561] "<LI><img src=\"graphics/HNAF_Wiley.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [562] "\"News Analytics: Framework, Techniques and Metrics,\" The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "                                                                                                                                                                            
## [563] "<a href=\"Papers/newsmetrics.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [564] "</LI>"                                                                                                                                                                                                                                                                                                           
## [565] ""                                                                                                                                                                                                                                                                                                                
## [566] ""                                                                                                                                                                                                                                                                                                                
## [567] ""                                                                                                                                                                                                                                                                                                                
## [568] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [569] "\"Random Lattices for Option Pricing Problems in Finance,\" (2011),"                                                                                                                                                                                                                                             
## [570] "<I>Journal of Investment Management</I>, 9(2), 88-106."                                                                                                                                                                                                                                                          
## [571] "<a href=\"Papers/randlatt.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [572] "</LI>"                                                                                                                                                                                                                                                                                                           
## [573] ""                                                                                                                                                                                                                                                                                                                
## [574] ""                                                                                                                                                                                                                                                                                                                
## [575] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [576] "\"Implementing Option Pricing Models using Python and Cython,\" (2010),"                                                                                                                                                                                                                                         
## [577] "(with Brian Granger), <I>Journal of Investment Management</I>, 9(4), 72-84"                                                                                                                                                                                                                                      
## [578] "<a href=\"Papers/cython.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [579] "</LI>"                                                                                                                                                                                                                                                                                                           
## [580] ""                                                                                                                                                                                                                                                                                                                
## [581] ""                                                                                                                                                                                                                                                                                                                
## [582] ""                                                                                                                                                                                                                                                                                                                
## [583] "<LI><img src=\"graphics/IEEE_IS_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                         
## [584] "\"The Finance Web: Internet Information and Markets,\" (2010), "                                                                                                                                                                                                                                                 
## [585] "<I>IEEE Intelligent Systems</I>, 25(2), Mar/Apr, 74--78. "                                                                                                                                                                                                                                                       
## [586] "</LI>"                                                                                                                                                                                                                                                                                                           
## [587] ""                                                                                                                                                                                                                                                                                                                
## [588] ""                                                                                                                                                                                                                                                                                                                
## [589] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [590] "\"Financial Applications with Parallel R,\" (2009), "                                                                                                                                                                                                                                                            
## [591] "(with Brian Granger), <I>Journal of Investment Management</I>, 7(4), 66-77"                                                                                                                                                                                                                                      
## [592] "<a href=\"Papers/parallelr_options.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                              
## [593] "</LI>"                                                                                                                                                                                                                                                                                                           
## [594] ""                                                                                                                                                                                                                                                                                                                
## [595] ""                                                                                                                                                                                                                                                                                                                
## [596] "<LI><img src=\"graphics/EQF.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [597] "\"Recovery Swaps,\" (2009), (with Paul Hanouna),  "                                                                                                                                                                                                                                                              
## [598] "<I>Encyclopedia of Quantitative Finance</I>, John Wiley and Sons, U.K., 1507--1509 "                                                                                                                                                                                                                             
## [599] ""                                                                                                                                                                                                                                                                                                                
## [600] "<LI><img src=\"graphics/EQF.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [601] "\"Recovery Rates,\" (2009),(with Paul Hanouna), "                                                                                                                                                                                                                                                                
## [602] "<I>Encyclopedia of Quantitative Finance</I>, John Wiley and Sons, U.K., 1505--1507"                                                                                                                                                                                                                              
## [603] ""                                                                                                                                                                                                                                                                                                                
## [604] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [605] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "                                                                                                                                                                                                                                 
## [606] "<I> Innovations in Investment Management</I>, Bloomberg Press, 85-112."                                                                                                                                                                                                                                          
## [607] ""                                                                                                                                                                                                                                                                                                                
## [608] ""                                                                                                                                                                                                                                                                                                                
## [609] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [610] "\"Credit Default Swap Spreads\", 2006, (with Paul Hanouna), "                                                                                                                                                                                                                                                    
## [611] "<I>Journal of Investment Management</I>, v4(3), 93-105."                                                                                                                                                                                                                                                         
## [612] "</LI>"                                                                                                                                                                                                                                                                                                           
## [613] ""                                                                                                                                                                                                                                                                                                                
## [614] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [615] "\"Multiple-Core Processors for Finance Applications,\" 2006, "                                                                                                                                                                                                                                                   
## [616] "<I>Journal of Investment Management</I>, v4(2), 76-81."                                                                                                                                                                                                                                                          
## [617] "</LI>"                                                                                                                                                                                                                                                                                                           
## [618] ""                                                                                                                                                                                                                                                                                                                
## [619] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [620] "\"Power Laws,\" 2005, (with Jacob Sisk), "                                                                                                                                                                                                                                                                       
## [621] "<I>Journal of Investment Management</I>, v3(3), 84-91."                                                                                                                                                                                                                                                          
## [622] "<a href=\"https://www.joim.com/ArticleContainer.asp?artID=154\">[PDF]</a>"                                                                                                                                                                                                                                       
## [623] "</LI>"                                                                                                                                                                                                                                                                                                           
## [624] ""                                                                                                                                                                                                                                                                                                                
## [625] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [626] "\"Genetic Algorithms,\" 2005,"                                                                                                                                                                                                                                                                                   
## [627] "<I>Journal of Investment Management</I>, v3(2), 77-82."                                                                                                                                                                                                                                                          
## [628] "</LI>"                                                                                                                                                                                                                                                                                                           
## [629] ""                                                                                                                                                                                                                                                                                                                
## [630] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [631] "\"Recovery Risk,\" 2005,"                                                                                                                                                                                                                                                                                        
## [632] "<I>Journal of Investment Management</I>, v3(1), 113-120."                                                                                                                                                                                                                                                        
## [633] "</LI>"                                                                                                                                                                                                                                                                                                           
## [634] ""                                                                                                                                                                                                                                                                                                                
## [635] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [636] "\"Venture Capital Syndication\", (with Hoje Jo and Yongtae Kim), 2004"                                                                                                                                                                                                                                           
## [637] "<I>Journal of Investment Management</I>, v2(4), 132-143."                                                                                                                                                                                                                                                        
## [638] "</LI>"                                                                                                                                                                                                                                                                                                           
## [639] ""                                                                                                                                                                                                                                                                                                                
## [640] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [641] "\"Technical Analysis\", (with David Tien), 2004"                                                                                                                                                                                                                                                                 
## [642] "<I>Journal of Investment Management</I>, v2(1), 79-85."                                                                                                                                                                                                                                                          
## [643] "</LI>"                                                                                                                                                                                                                                                                                                           
## [644] ""                                                                                                                                                                                                                                                                                                                
## [645] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [646] "\"Liquidity and the Bond Markets, (with Jan Ericsson and "                                                                                                                                                                                                                                                       
## [647] "Madhu Kalimipalli), 2003,"                                                                                                                                                                                                                                                                                       
## [648] "<I>Journal of Investment Management</I>, v1(4), 95-103."                                                                                                                                                                                                                                                         
## [649] "</LI>"                                                                                                                                                                                                                                                                                                           
## [650] ""                                                                                                                                                                                                                                                                                                                
## [651] "<LI><img src=\"graphics/JEL_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [652] "\"Modern Pricing of Interest Rate Derivatives - Book Review\", "                                                                                                                                                                                                                                                 
## [653] "2004, <I>Journal of Economic Literature</I>, vXLII, 528-529."                                                                                                                                                                                                                                                    
## [654] ""                                                                                                                                                                                                                                                                                                                
## [655] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [656] "\"Contagion\", 2003,"                                                                                                                                                                                                                                                                                            
## [657] "<I>Journal of Investment Management</I>, v1(3), 78-84."                                                                                                                                                                                                                                                          
## [658] "</LI>"                                                                                                                                                                                                                                                                                                           
## [659] ""                                                                                                                                                                                                                                                                                                                
## [660] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [661] "\"Hedge Funds\", 2003,"                                                                                                                                                                                                                                                                                          
## [662] "<I>Journal of Investment Management</I>, v1(2), 76-81."                                                                                                                                                                                                                                                          
## [663] "Reprinted in "                                                                                                                                                                                                                                                                                                   
## [664] "\"Working Papers on Hedge Funds,\" in The World of Hedge Funds: "                                                                                                                                                                                                                                                
## [665] "Characteristics and "                                                                                                                                                                                                                                                                                            
## [666] "Analysis, 2005, World Scientific."                                                                                                                                                                                                                                                                               
## [667] "</LI>"                                                                                                                                                                                                                                                                                                           
## [668] ""                                                                                                                                                                                                                                                                                                                
## [669] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [670] "\"The Internet and Investors\", 2003,"                                                                                                                                                                                                                                                                           
## [671] "<I>Journal of Investment Management</I>, v1(1), 213-217."                                                                                                                                                                                                                                                        
## [672] "</LI>"                                                                                                                                                                                                                                                                                                           
## [673] ""                                                                                                                                                                                                                                                                                                                
## [674] "<LI><img src=\"graphics/EC_cover.gif\">"                                                                                                                                                                                                                                                                         
## [675] "  \"Useful things to know about Correlated Default Risk,\""                                                                                                                                                                                                                                                      
## [676] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                                                                                                                                                                                             
## [677] "2001,&nbsp; <i>Extra Credit</i>, November-December, 14-23."                                                                                                                                                                                                                                                      
## [678] "</LI>"                                                                                                                                                                                                                                                                                                           
## [679] ""                                                                                                                                                                                                                                                                                                                
## [680] "<LI><img src=\"graphics/QAFM_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [681] "\"The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "                                                                                                                                                                                                                                  
## [682] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"                                                                                                                                                                                                                                                       
## [683] "Courant Institute of Mathematical Sciences, special volume on"                                                                                                                                                                                                                                                   
## [684] "<I>Quantitative Analysis in Financial Markets</I>, Volume III, 2001."                                                                                                                                                                                                                                            
## [685] "</LI>"                                                                                                                                                                                                                                                                                                           
## [686] ""                                                                                                                                                                                                                                                                                                                
## [687] "<LI><img src=\"graphics/QAFM_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [688] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                                                                                                                                                                                                  
## [689] "(with Rangarajan Sundaram), reprinted in "                                                                                                                                                                                                                                                                       
## [690] "the Courant Institute of Mathematical Sciences, special volume on"                                                                                                                                                                                                                                               
## [691] "<I>Quantitative Analysis in Financial Markets</I>, Volume III, 2001."                                                                                                                                                                                                                                            
## [692] "</LI>"                                                                                                                                                                                                                                                                                                           
## [693] ""                                                                                                                                                                                                                                                                                                                
## [694] "<LI><img src=\"graphics/AFIVT_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [695] "\"Stochastic Mean Models of the Term Structure,''"                                                                                                                                                                                                                                                               
## [696] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "                                                                                                                                                                                                                                            
## [697] "2000, <I>Advanced Fixed-Income Valuation Tools"                                                                                                                                                                                                                                                                  
## [698] "</I>, edited by N. Jegadeesh and B. Tuckman,"                                                                                                                                                                                                                                                                    
## [699] "John Wiley & Sons, Inc., 128-161."                                                                                                                                                                                                                                                                               
## [700] "</LI>"                                                                                                                                                                                                                                                                                                           
## [701] ""                                                                                                                                                                                                                                                                                                                
## [702] "<LI><img src=\"graphics/AFIVT_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [703] "\"Interest Rate Modeling with Jump-Diffusion Processes,'' "                                                                                                                                                                                                                                                      
## [704] "2000, <I>Advanced Fixed-Income Valuation Tools"                                                                                                                                                                                                                                                                  
## [705] "</I>, edited by N. Jegadeesh and B. Tuckman,"                                                                                                                                                                                                                                                                    
## [706] "John Wiley & Sons, Inc., 162-189."                                                                                                                                                                                                                                                                               
## [707] "</LI>"                                                                                                                                                                                                                                                                                                           
## [708] ""                                                                                                                                                                                                                                                                                                                
## [709] "<LI><img src=\"graphics/FCR_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [710] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"                                                                                                                                                                                                                                               
## [711] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"                                                                                                                                                                                                                                        
## [712] "in <I>The Financing of Catastrophe Risk</I>, Kenneth A"                                                                                                                                                                                                                                                          
## [713] "Froot (Ed.), University of Chicago Press, 1999, 141-145."                                                                                                                                                                                                                                                        
## [714] "</LI>"                                                                                                                                                                                                                                                                                                           
## [715] ""                                                                                                                                                                                                                                                                                                                
## [716] "<LI><img src=\"graphics/HCD_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [717] "  \"Pricing Credit Derivatives,'' "                                                                                                                                                                                                                                                                              
## [718] "1999, <I>Handbook of Credit Derivatives</I>, eds J. Francis,"                                                                                                                                                                                                                                                    
## [719] "J. Frost and J.G. Whittaker, 101-138."                                                                                                                                                                                                                                                                           
## [720] "</LI>"                                                                                                                                                                                                                                                                                                           
## [721] ""                                                                                                                                                                                                                                                                                                                
## [722] "<LI><img src=\"graphics/PEC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [723] "\"On the Recursive Implementation of Term Structure Models,'' "                                                                                                                                                                                                                                                  
## [724] "1998, <I>Pecunia</I>, The Netherlands, Summer 1998, 45-49."                                                                                                                                                                                                                                                      
## [725] "</LI>"                                                                                                                                                                                                                                                                                                           
## [726] ""                                                                                                                                                                                                                                                                                                                
## [727] ""                                                                                                                                                                                                                                                                                                                
## [728] "</OL>"                                                                                                                                                                                                                                                                                                           
## [729] ""                                                                                                                                                                                                                                                                                                                
## [730] ""                                                                                                                                                                                                                                                                                                                
## [731] "<H2>WORKING PAPERS</H2>"                                                                                                                                                                                                                                                                                         
## [732] ""                                                                                                                                                                                                                                                                                                                
## [733] "<OL>"                                                                                                                                                                                                                                                                                                            
## [734] ""                                                                                                                                                                                                                                                                                                                
## [735] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [736] "\"Efficient Rebalancing of Taxable Portfolios\" (with Dan Ostrov, Dennis Ding, Vincent Newell), "                                                                                                                                                                                                                
## [737] "<a href=\"Papers/taxopt.pdf\">[PDF]</a>. "                                                                                                                                                                                                                                                                       
## [738] "<a href=\"Papers/taxopt_slides_RFinance_2015_05.pdf\">[SLIDES RFinance]</a>. "                                                                                                                                                                                                                                   
## [739] "<a href=\"Papers/taxopt_slides2.pdf\">[SLIDES JOIM]</a>. "                                                                                                                                                                                                                                                       
## [740] ""                                                                                                                                                                                                                                                                                                                
## [741] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [742] "\"Rollover Risk and Capital Structure Covenants in Structured Finance Vehicles\","                                                                                                                                                                                                                               
## [743] " (with Seoyoung Kim), "                                                                                                                                                                                                                                                                                          
## [744] "<a href=\"Papers/siv2.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [745] ""                                                                                                                                                                                                                                                                                                                
## [746] ""                                                                                                                                                                                                                                                                                                                
## [747] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [748] "\"Liability Directed Investing in a Behavioral Portfolio Theory Framework"                                                                                                                                                                                                                                       
## [749] " (with Seoyoung Kim and Meir Statman), "                                                                                                                                                                                                                                                                         
## [750] "<a href=\"Papers/ldibpt.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [751] ""                                                                                                                                                                                                                                                                                                                
## [752] ""                                                                                                                                                                                                                                                                                                                
## [753] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [754] "\"The Fast and the Curious: VC Drift\" "                                                                                                                                                                                                                                                                         
## [755] "(with Amit Bubna and Paul Hanouna), "                                                                                                                                                                                                                                                                            
## [756] "<a href=\"Papers/vcstyle.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                        
## [757] ""                                                                                                                                                                                                                                                                                                                
## [758] ""                                                                                                                                                                                                                                                                                                                
## [759] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [760] "\"Venture Capital Communities\" (with Amit Bubna and Nagpurnanand Prabhala), "                                                                                                                                                                                                                                   
## [761] "<a href=\"Papers/vccomm.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [762] ""                                                                                                                                                                                                                                                                                                                
## [763] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [764] "\"The Design and Risk Management of Structured Finance Vehicles,\" (with Seoyoung Kim), <a href=\"Papers/siv.pdf\">[PDF]</a>"                                                                                                                                                                                    
## [765] ""                                                                                                                                                                                                                                                                                                                
## [766] ""                                                                                                                                                                                                                                                                                                                
## [767] ""                                                                                                                                                                                                                                                                                                                
## [768] ""                                                                                                                                                                                                                                                                                                                
## [769] "</OL>"                                                                                                                                                                                                                                                                                                           
## [770] ""                                                                                                                                                                                                                                                                                                                
## [771] ""                                                                                                                                                                                                                                                                                                                
## [772] ""                                                                                                                                                                                                                                                                                                                
## [773] ""                                                                                                                                                                                                                                                                                                                
## [774] ""                                                                                                                                                                                                                                                                                                                
## [775] ""                                                                                                                                                                                                                                                                                                                
## [776] ""                                                                                                                                                                                                                                                                                                                
## [777] ""                                                                                                                                                                                                                                                                                                                
## [778] "</UL>"                                                                                                                                                                                                                                                                                                           
## [779] "<p>"                                                                                                                                                                                                                                                                                                             
## [780] "My page on SSRN (with downloadable papers) is <a"                                                                                                                                                                                                                                                                
## [781] "href=\"http://ssrn.com/author=17108\">here</a>."                                                                                                                                                                                                                                                                 
## [782] ""                                                                                                                                                                                                                                                                                                                
## [783] ""                                                                                                                                                                                                                                                                                                                
## [784] ""                                                                                                                                                                                                                                                                                                                
## [785] "                                                "                                                                                                                                                                                                                                                                
## [786] ""                                                                                                                                                                                                                                                                                                                
## [787] ""                                                                                                                                                                                                                                                                                                                
## [788] ""                                                                                                                                                                                                                                                                                                                
## [789] "</BODY>"                                                                                                                                                                                                                                                                                                         
## [790] ""                                                                                                                                                                                                                                                                                                                
## [791] "</HTML>"                                                                                                                                                                                                                                                                                                         
## [792] ""                                                                                                                                                                                                                                                                                                                
## [793] ""                                                                                                                                                                                                                                                                                                                
## [794] ""
text = text[setdiff(seq(1,length(text)),grep("<",text))]
text = text[setdiff(seq(1,length(text)),grep(">",text))]
text = text[setdiff(seq(1,length(text)),grep("]",text))]
text = text[setdiff(seq(1,length(text)),grep("}",text))]
text = text[setdiff(seq(1,length(text)),grep("_",text))]
text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
print(length(text))
## [1] 324
print(text)
##   [1] ""                                                                                                                                    
##   [2] ""                                                                                                                                    
##   [3] ""                                                                                                                                    
##   [4] "\"Data Science: Theories, Models, Algorithms, and Analytics\" (web book -- work in progress)"                                        
##   [5] ""                                                                                                                                    
##   [6] ""                                                                                                                                    
##   [7] "\"Derivatives: Principles and Practice\" (2010),"                                                                                    
##   [8] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."                                                                                  
##   [9] ""                                                                                                                                    
##  [10] ""                                                                                                                                    
##  [11] ""                                                                                                                                    
##  [12] ""                                                                                                                                    
##  [13] "\"An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."                                                 
##  [14] ""                                                                                                                                    
##  [15] "\"Matrix Metrics: Network-Based Systemic Risk Scoring\", (2016)."                                                                    
##  [16] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "                                               
##  [17] "the best paper on SIFIs (systemically important financial institutions). "                                                           
##  [18] "It also won the best paper award at "                                                                                                
##  [19] ""                                                                                                                                    
##  [20] ""                                                                                                                                    
##  [21] ""                                                                                                                                    
##  [22] ""                                                                                                                                    
##  [23] "\"Credit Spreads with Dynamic Debt\" (with Seoyoung Kim), (2015), "                                                                  
##  [24] ""                                                                                                                                    
##  [25] "\"Text and Context: Language Analytics for Finance\", (2014),"                                                                       
##  [26] ""                                                                                                                                    
##  [27] ""                                                                                                                                    
##  [28] ""                                                                                                                                    
##  [29] "\"Strategic Loan Modification: An Options-Based Response to Strategic Default,\""                                                    
##  [30] ""                                                                                                                                    
##  [31] ""                                                                                                                                    
##  [32] "\"Options and Structured Products in Behavioral Portfolios,\" (with Meir Statman), (2013), "                                         
##  [33] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."                                                      
##  [34] ""                                                                                                                                    
##  [35] ""                                                                                                                                    
##  [36] ""                                                                                                                                    
##  [37] "\"Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance,\" (2011), (with Hoje Jo and Yongtae Kim), "        
##  [38] ""                                                                                                                                    
##  [39] "Optimization with Mental Accounts,\" (2010), (with Harry Markowitz, Jonathan"                                                        
##  [40] ""                                                                                                                                    
##  [41] ""                                                                                                                                    
##  [42] ""                                                                                                                                    
##  [43] ""                                                                                                                                    
##  [44] ""                                                                                                                                    
##  [45] ""                                                                                                                                    
##  [46] ""                                                                                                                                    
##  [47] ""                                                                                                                                    
##  [48] "\"Accounting-based versus market-based cross-sectional models of CDS spreads,\" "                                                    
##  [49] "(with Paul Hanouna and Atulya Sarin), (2009), "                                                                                      
##  [50] ""                                                                                                                                    
##  [51] ""                                                                                                                                    
##  [52] "\"Hedging Credit: Equity Liquidity Matters,\" (with Paul Hanouna), (2009),"                                                          
##  [53] ""                                                                                                                                    
##  [54] "\"An Integrated Model for Hybrid Securities,\""                                                                                      
##  [55] ""                                                                                                                                    
##  [56] "\"Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,\""                                                              
##  [57] ""                                                                                                                                    
##  [58] "\"Common Failings: How Corporate Defaults are Correlated\" "                                                                         
##  [59] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."                                                                            
##  [60] ""                                                                                                                                    
##  [61] "\"A Clinical Study of Investor Discussion and Sentiment,\" "                                                                         
##  [62] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "                                                                                 
##  [63] ""                                                                                                                                    
##  [64] "\"International Portfolio Choice with Systemic Risk,\""                                                                              
##  [65] "The loss resulting from diminished diversification is small, while"                                                                  
##  [66] ""                                                                                                                                    
##  [67] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"                                                                 
##  [68] "investor welfare. Contrary to regulatory intuition, incentive structures"                                                            
##  [69] ""                                                                                                                                    
##  [70] "\"A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"                                                            
##  [71] "with Rating Transitions,\" (with Viral Acharya and Rangarajan Sundaram),"                                                            
##  [72] ""                                                                                                                                    
##  [73] ""                                                                                                                                    
##  [74] "\"Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"                                                     
##  [75] ""                                                                                                                                    
##  [76] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                      
##  [77] ""                                                                                                                                    
##  [78] "\"The Psychology of Financial Decision Making: A Case"                                                                               
##  [79] "for Theory-Driven Experimental Enquiry,''"                                                                                           
##  [80] "1999, (with Priya Raghubir),"                                                                                                        
##  [81] ""                                                                                                                                    
##  [82] "\"Of Smiles and Smirks: A Term Structure Perspective,''"                                                                             
##  [83] ""                                                                                                                                    
##  [84] "\"A Theory of Banking Structure,\" 1999, (with Ashish Nanda),"                                                                       
##  [85] "by function based upon two dimensions: the degree of information asymmetry "                                                         
##  [86] ""                                                                                                                                    
##  [87] "\"A Theory of Optimal Timing and Selectivity,'' "                                                                                    
##  [88] ""                                                                                                                                    
##  [89] "\"A Direct Discrete-Time Approach to"                                                                                                
##  [90] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "                                                                    
##  [91] ""                                                                                                                                    
##  [92] "\"The Central Tendency: A Second Factor in"                                                                                          
##  [93] "Bond Yields,\" 1998, (with Silverio Foresi and Pierluigi Balduzzi),  "                                                               
##  [94] ""                                                                                                                                    
##  [95] "\"Efficiency with Costly Information: A Reinterpretation of"                                                                         
##  [96] "Evidence from Managed Portfolios,\" (with Edwin Elton, Martin Gruber and Matt "                                                      
##  [97] "Presented and Reprinted in the Proceedings of The "                                                                                  
##  [98] "Seminar on the Analysis of Security Prices at the Center "                                                                           
##  [99] "for Research in Security   Prices  at the University of "                                                                            
## [100] ""                                                                                                                                    
## [101] ""                                                                                                                                    
## [102] ""                                                                                                                                    
## [103] ""                                                                                                                                    
## [104] ""                                                                                                                                    
## [105] ""                                                                                                                                    
## [106] ""                                                                                                                                    
## [107] "\"Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework\" (2014), "                                                
## [108] "(with Seoyoung Kim and Meir Statman),  "                                                                                             
## [109] ""                                                                                                                                    
## [110] ""                                                                                                                                    
## [111] "\"Going for Broke: Restructuring Distressed Debt Portfolios\" (2014),"                                                               
## [112] ""                                                                                                                                    
## [113] ""                                                                                                                                    
## [114] "\"Digital Portfolios.\" (2013), "                                                                                                    
## [115] ""                                                                                                                                    
## [116] ""                                                                                                                                    
## [117] "\"Options on Portfolios with Higher-Order Moments,\" (2009),"                                                                        
## [118] "options on a multivariate system of assets, calibrated to the return "                                                               
## [119] ""                                                                                                                                    
## [120] ""                                                                                                                                    
## [121] "\"Dealing with Dimension: Option Pricing on Factor Trees,\" (2009),"                                                                 
## [122] "you to price options on multiple assets in a unified fraamework. Computational"                                                      
## [123] ""                                                                                                                                    
## [124] ""                                                                                                                                    
## [125] "\"Modeling"                                                                                                                          
## [126] "Correlated Default with a Forest of Binomial Trees,\" (2007), (with"                                                                 
## [127] ""                                                                                                                                    
## [128] "\"Basel II: Correlation Related Issues\" (2007), "                                                                                   
## [129] ""                                                                                                                                    
## [130] "\"Correlated Default Risk,\" (2006),"                                                                                                
## [131] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                               
## [132] "increase as markets worsen. Regime switching models are needed to explain dynamic"                                                   
## [133] ""                                                                                                                                    
## [134] "\"A Simple Model for Pricing Equity Options with Markov"                                                                             
## [135] "Switching State Variables\" (2006),"                                                                                                 
## [136] "(with Donald Aingworth and Rajeev Motwani),"                                                                                         
## [137] ""                                                                                                                                    
## [138] "\"The Firm's Management of Social Interactions,\" (2005)"                                                                            
## [139] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "                                                                        
## [140] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "                                                                           
## [141] ""                                                                                                                                    
## [142] "\"Financial Communities\" (with Jacob Sisk), 2005, "                                                                                 
## [143] "Summer, 112-123."                                                                                                                    
## [144] ""                                                                                                                                    
## [145] "\"Monte Carlo Markov Chain Methods for Derivative Pricing"                                                                           
## [146] "and Risk Assessment,\"(with Alistair Sinclair), 2005, "                                                                              
## [147] "where incomplete information about the value of an asset may be exploited to "                                                       
## [148] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "                                                      
## [149] ""                                                                                                                                    
## [150] "\"Correlated Default Processes: A Criterion-Based Copula Approach,\""                                                                
## [151] "Special Issue on Default Risk. "                                                                                                     
## [152] ""                                                                                                                                    
## [153] "\"Private Equity Returns: An Empirical Examination of the Exit of"                                                                   
## [154] "Venture-Backed Companies,\" (with Murali Jagannathan and Atulya Sarin),"                                                             
## [155] "firm being financed, the valuation at the time of financing, and the prevailing market"                                              
## [156] "sentiment. Helps understand the risk premium required for the"                                                                       
## [157] ""                                                                                                                                    
## [158] "Issue on Computational Methods in Economics and Finance),  "                                                                         
## [159] "December, 55-69."                                                                                                                    
## [160] ""                                                                                                                                    
## [161] "\"Bayesian Migration in Credit Ratings Based on Probabilities of"                                                                    
## [162] ""                                                                                                                                    
## [163] "\"The Impact of Correlated Default Risk on Credit Portfolios,\""                                                                     
## [164] "(with Gifford Fong, and Gary Geng),"                                                                                                 
## [165] ""                                                                                                                                    
## [166] "\"How Diversified are Internationally Diversified Portfolios:"                                                                       
## [167] "Time-Variation in the Covariances between International Returns,\""                                                                  
## [168] ""                                                                                                                                    
## [169] "\"Discrete-Time Bond and Option Pricing for Jump-Diffusion"                                                                          
## [170] ""                                                                                                                                    
## [171] "\"Macroeconomic Implications of Search Theory for the Labor Market,\""                                                               
## [172] ""                                                                                                                                    
## [173] "\"Auction Theory: A Summary with Applications and Evidence"                                                                          
## [174] "from the Treasury Markets,\" 1996, (with Rangarajan Sundaram),"                                                                      
## [175] ""                                                                                                                                    
## [176] "\"A Simple Approach to Three Factor Affine Models of the"                                                                            
## [177] "Term Structure,\" (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"                                                          
## [178] ""                                                                                                                                    
## [179] "\"Analytical Approximations of  the Term Structure"                                                                                  
## [180] "for Jump-diffusion Processes: A Numerical Analysis,\" 1996, "                                                                        
## [181] ""                                                                                                                                    
## [182] "Markov Chain Term Structure Models: Extensions and Applications,\""                                                                  
## [183] ""                                                                                                                                    
## [184] ""                                                                                                                                    
## [185] "\"Exact Solutions for Bond and Options Prices"                                                                                       
## [186] "with Systematic Jump Risk,\" 1996, (with Silverio Foresi),"                                                                          
## [187] ""                                                                                                                                    
## [188] "\"Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"                                                                 
## [189] "and Credit Spreads are Stochastic,\" 1996, "                                                                                         
## [190] "v5(2), 161-198."                                                                                                                     
## [191] ""                                                                                                                                    
## [192] ""                                                                                                                                    
## [193] ""                                                                                                                                    
## [194] ""                                                                                                                                    
## [195] ""                                                                                                                                    
## [196] ""                                                                                                                                    
## [197] ""                                                                                                                                    
## [198] "\"Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier,\" (2011), "          
## [199] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "                                                                         
## [200] ""                                                                                                                                    
## [201] "\"News Analytics: Framework, Techniques and Metrics,\" The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [202] ""                                                                                                                                    
## [203] ""                                                                                                                                    
## [204] ""                                                                                                                                    
## [205] "\"Random Lattices for Option Pricing Problems in Finance,\" (2011),"                                                                 
## [206] ""                                                                                                                                    
## [207] ""                                                                                                                                    
## [208] "\"Implementing Option Pricing Models using Python and Cython,\" (2010),"                                                             
## [209] ""                                                                                                                                    
## [210] ""                                                                                                                                    
## [211] ""                                                                                                                                    
## [212] "\"The Finance Web: Internet Information and Markets,\" (2010), "                                                                     
## [213] ""                                                                                                                                    
## [214] ""                                                                                                                                    
## [215] "\"Financial Applications with Parallel R,\" (2009), "                                                                                
## [216] ""                                                                                                                                    
## [217] ""                                                                                                                                    
## [218] "\"Recovery Swaps,\" (2009), (with Paul Hanouna),  "                                                                                  
## [219] ""                                                                                                                                    
## [220] "\"Recovery Rates,\" (2009),(with Paul Hanouna), "                                                                                    
## [221] ""                                                                                                                                    
## [222] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "                                                     
## [223] ""                                                                                                                                    
## [224] ""                                                                                                                                    
## [225] "\"Credit Default Swap Spreads\", 2006, (with Paul Hanouna), "                                                                        
## [226] ""                                                                                                                                    
## [227] "\"Multiple-Core Processors for Finance Applications,\" 2006, "                                                                       
## [228] ""                                                                                                                                    
## [229] "\"Power Laws,\" 2005, (with Jacob Sisk), "                                                                                           
## [230] ""                                                                                                                                    
## [231] "\"Genetic Algorithms,\" 2005,"                                                                                                       
## [232] ""                                                                                                                                    
## [233] "\"Recovery Risk,\" 2005,"                                                                                                            
## [234] ""                                                                                                                                    
## [235] "\"Venture Capital Syndication\", (with Hoje Jo and Yongtae Kim), 2004"                                                               
## [236] ""                                                                                                                                    
## [237] "\"Technical Analysis\", (with David Tien), 2004"                                                                                     
## [238] ""                                                                                                                                    
## [239] "\"Liquidity and the Bond Markets, (with Jan Ericsson and "                                                                           
## [240] "Madhu Kalimipalli), 2003,"                                                                                                           
## [241] ""                                                                                                                                    
## [242] "\"Modern Pricing of Interest Rate Derivatives - Book Review\", "                                                                     
## [243] ""                                                                                                                                    
## [244] "\"Contagion\", 2003,"                                                                                                                
## [245] ""                                                                                                                                    
## [246] "\"Hedge Funds\", 2003,"                                                                                                              
## [247] "Reprinted in "                                                                                                                       
## [248] "\"Working Papers on Hedge Funds,\" in The World of Hedge Funds: "                                                                    
## [249] "Characteristics and "                                                                                                                
## [250] "Analysis, 2005, World Scientific."                                                                                                   
## [251] ""                                                                                                                                    
## [252] "\"The Internet and Investors\", 2003,"                                                                                               
## [253] ""                                                                                                                                    
## [254] "  \"Useful things to know about Correlated Default Risk,\""                                                                          
## [255] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                 
## [256] ""                                                                                                                                    
## [257] "\"The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "                                                      
## [258] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"                                                                           
## [259] "Courant Institute of Mathematical Sciences, special volume on"                                                                       
## [260] ""                                                                                                                                    
## [261] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                      
## [262] "(with Rangarajan Sundaram), reprinted in "                                                                                           
## [263] "the Courant Institute of Mathematical Sciences, special volume on"                                                                   
## [264] ""                                                                                                                                    
## [265] "\"Stochastic Mean Models of the Term Structure,''"                                                                                   
## [266] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "                                                                
## [267] "John Wiley & Sons, Inc., 128-161."                                                                                                   
## [268] ""                                                                                                                                    
## [269] "\"Interest Rate Modeling with Jump-Diffusion Processes,'' "                                                                          
## [270] "John Wiley & Sons, Inc., 162-189."                                                                                                   
## [271] ""                                                                                                                                    
## [272] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"                                                                   
## [273] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"                                                            
## [274] "Froot (Ed.), University of Chicago Press, 1999, 141-145."                                                                            
## [275] ""                                                                                                                                    
## [276] "  \"Pricing Credit Derivatives,'' "                                                                                                  
## [277] "J. Frost and J.G. Whittaker, 101-138."                                                                                               
## [278] ""                                                                                                                                    
## [279] "\"On the Recursive Implementation of Term Structure Models,'' "                                                                      
## [280] ""                                                                                                                                    
## [281] ""                                                                                                                                    
## [282] ""                                                                                                                                    
## [283] ""                                                                                                                                    
## [284] ""                                                                                                                                    
## [285] ""                                                                                                                                    
## [286] "\"Efficient Rebalancing of Taxable Portfolios\" (with Dan Ostrov, Dennis Ding, Vincent Newell), "                                    
## [287] ""                                                                                                                                    
## [288] "\"Rollover Risk and Capital Structure Covenants in Structured Finance Vehicles\","                                                   
## [289] " (with Seoyoung Kim), "                                                                                                              
## [290] ""                                                                                                                                    
## [291] ""                                                                                                                                    
## [292] "\"Liability Directed Investing in a Behavioral Portfolio Theory Framework"                                                           
## [293] " (with Seoyoung Kim and Meir Statman), "                                                                                             
## [294] ""                                                                                                                                    
## [295] ""                                                                                                                                    
## [296] "\"The Fast and the Curious: VC Drift\" "                                                                                             
## [297] "(with Amit Bubna and Paul Hanouna), "                                                                                                
## [298] ""                                                                                                                                    
## [299] ""                                                                                                                                    
## [300] "\"Venture Capital Communities\" (with Amit Bubna and Nagpurnanand Prabhala), "                                                       
## [301] ""                                                                                                                                    
## [302] ""                                                                                                                                    
## [303] ""                                                                                                                                    
## [304] ""                                                                                                                                    
## [305] ""                                                                                                                                    
## [306] ""                                                                                                                                    
## [307] ""                                                                                                                                    
## [308] ""                                                                                                                                    
## [309] ""                                                                                                                                    
## [310] ""                                                                                                                                    
## [311] ""                                                                                                                                    
## [312] ""                                                                                                                                    
## [313] ""                                                                                                                                    
## [314] ""                                                                                                                                    
## [315] ""                                                                                                                                    
## [316] ""                                                                                                                                    
## [317] "                                                "                                                                                    
## [318] ""                                                                                                                                    
## [319] ""                                                                                                                                    
## [320] ""                                                                                                                                    
## [321] ""                                                                                                                                    
## [322] ""                                                                                                                                    
## [323] ""                                                                                                                                    
## [324] ""
text = str_replace_all(text,"[\"]","")
idx = which(nchar(text)==0)
research = text[setdiff(seq(1,length(text)),idx)]
print(research)
##   [1] "Data Science: Theories, Models, Algorithms, and Analytics (web book -- work in progress)"                                        
##   [2] "Derivatives: Principles and Practice (2010),"                                                                                    
##   [3] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."                                                                              
##   [4] "An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."                                               
##   [5] "Matrix Metrics: Network-Based Systemic Risk Scoring, (2016)."                                                                    
##   [6] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "                                           
##   [7] "the best paper on SIFIs (systemically important financial institutions). "                                                       
##   [8] "It also won the best paper award at "                                                                                            
##   [9] "Credit Spreads with Dynamic Debt (with Seoyoung Kim), (2015), "                                                                  
##  [10] "Text and Context: Language Analytics for Finance, (2014),"                                                                       
##  [11] "Strategic Loan Modification: An Options-Based Response to Strategic Default,"                                                    
##  [12] "Options and Structured Products in Behavioral Portfolios, (with Meir Statman), (2013), "                                         
##  [13] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."                                                  
##  [14] "Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance, (2011), (with Hoje Jo and Yongtae Kim), "        
##  [15] "Optimization with Mental Accounts, (2010), (with Harry Markowitz, Jonathan"                                                      
##  [16] "Accounting-based versus market-based cross-sectional models of CDS spreads, "                                                    
##  [17] "(with Paul Hanouna and Atulya Sarin), (2009), "                                                                                  
##  [18] "Hedging Credit: Equity Liquidity Matters, (with Paul Hanouna), (2009),"                                                          
##  [19] "An Integrated Model for Hybrid Securities,"                                                                                      
##  [20] "Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,"                                                              
##  [21] "Common Failings: How Corporate Defaults are Correlated "                                                                         
##  [22] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."                                                                        
##  [23] "A Clinical Study of Investor Discussion and Sentiment, "                                                                         
##  [24] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "                                                                             
##  [25] "International Portfolio Choice with Systemic Risk,"                                                                              
##  [26] "The loss resulting from diminished diversification is small, while"                                                              
##  [27] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"                                                             
##  [28] "investor welfare. Contrary to regulatory intuition, incentive structures"                                                        
##  [29] "A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"                                                          
##  [30] "with Rating Transitions, (with Viral Acharya and Rangarajan Sundaram),"                                                          
##  [31] "Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"                                                   
##  [32] "A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                    
##  [33] "The Psychology of Financial Decision Making: A Case"                                                                             
##  [34] "for Theory-Driven Experimental Enquiry,''"                                                                                       
##  [35] "1999, (with Priya Raghubir),"                                                                                                    
##  [36] "Of Smiles and Smirks: A Term Structure Perspective,''"                                                                           
##  [37] "A Theory of Banking Structure, 1999, (with Ashish Nanda),"                                                                       
##  [38] "by function based upon two dimensions: the degree of information asymmetry "                                                     
##  [39] "A Theory of Optimal Timing and Selectivity,'' "                                                                                  
##  [40] "A Direct Discrete-Time Approach to"                                                                                              
##  [41] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "                                                                
##  [42] "The Central Tendency: A Second Factor in"                                                                                        
##  [43] "Bond Yields, 1998, (with Silverio Foresi and Pierluigi Balduzzi),  "                                                             
##  [44] "Efficiency with Costly Information: A Reinterpretation of"                                                                       
##  [45] "Evidence from Managed Portfolios, (with Edwin Elton, Martin Gruber and Matt "                                                    
##  [46] "Presented and Reprinted in the Proceedings of The "                                                                              
##  [47] "Seminar on the Analysis of Security Prices at the Center "                                                                       
##  [48] "for Research in Security   Prices  at the University of "                                                                        
##  [49] "Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework (2014), "                                                
##  [50] "(with Seoyoung Kim and Meir Statman),  "                                                                                         
##  [51] "Going for Broke: Restructuring Distressed Debt Portfolios (2014),"                                                               
##  [52] "Digital Portfolios. (2013), "                                                                                                    
##  [53] "Options on Portfolios with Higher-Order Moments, (2009),"                                                                        
##  [54] "options on a multivariate system of assets, calibrated to the return "                                                           
##  [55] "Dealing with Dimension: Option Pricing on Factor Trees, (2009),"                                                                 
##  [56] "you to price options on multiple assets in a unified fraamework. Computational"                                                  
##  [57] "Modeling"                                                                                                                        
##  [58] "Correlated Default with a Forest of Binomial Trees, (2007), (with"                                                               
##  [59] "Basel II: Correlation Related Issues (2007), "                                                                                   
##  [60] "Correlated Default Risk, (2006),"                                                                                                
##  [61] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                           
##  [62] "increase as markets worsen. Regime switching models are needed to explain dynamic"                                               
##  [63] "A Simple Model for Pricing Equity Options with Markov"                                                                           
##  [64] "Switching State Variables (2006),"                                                                                               
##  [65] "(with Donald Aingworth and Rajeev Motwani),"                                                                                     
##  [66] "The Firm's Management of Social Interactions, (2005)"                                                                            
##  [67] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "                                                                    
##  [68] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "                                                                       
##  [69] "Financial Communities (with Jacob Sisk), 2005, "                                                                                 
##  [70] "Summer, 112-123."                                                                                                                
##  [71] "Monte Carlo Markov Chain Methods for Derivative Pricing"                                                                         
##  [72] "and Risk Assessment,(with Alistair Sinclair), 2005, "                                                                            
##  [73] "where incomplete information about the value of an asset may be exploited to "                                                   
##  [74] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "                                                  
##  [75] "Correlated Default Processes: A Criterion-Based Copula Approach,"                                                                
##  [76] "Special Issue on Default Risk. "                                                                                                 
##  [77] "Private Equity Returns: An Empirical Examination of the Exit of"                                                                 
##  [78] "Venture-Backed Companies, (with Murali Jagannathan and Atulya Sarin),"                                                           
##  [79] "firm being financed, the valuation at the time of financing, and the prevailing market"                                          
##  [80] "sentiment. Helps understand the risk premium required for the"                                                                   
##  [81] "Issue on Computational Methods in Economics and Finance),  "                                                                     
##  [82] "December, 55-69."                                                                                                                
##  [83] "Bayesian Migration in Credit Ratings Based on Probabilities of"                                                                  
##  [84] "The Impact of Correlated Default Risk on Credit Portfolios,"                                                                     
##  [85] "(with Gifford Fong, and Gary Geng),"                                                                                             
##  [86] "How Diversified are Internationally Diversified Portfolios:"                                                                     
##  [87] "Time-Variation in the Covariances between International Returns,"                                                                
##  [88] "Discrete-Time Bond and Option Pricing for Jump-Diffusion"                                                                        
##  [89] "Macroeconomic Implications of Search Theory for the Labor Market,"                                                               
##  [90] "Auction Theory: A Summary with Applications and Evidence"                                                                        
##  [91] "from the Treasury Markets, 1996, (with Rangarajan Sundaram),"                                                                    
##  [92] "A Simple Approach to Three Factor Affine Models of the"                                                                          
##  [93] "Term Structure, (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"                                                        
##  [94] "Analytical Approximations of  the Term Structure"                                                                                
##  [95] "for Jump-diffusion Processes: A Numerical Analysis, 1996, "                                                                      
##  [96] "Markov Chain Term Structure Models: Extensions and Applications,"                                                                
##  [97] "Exact Solutions for Bond and Options Prices"                                                                                     
##  [98] "with Systematic Jump Risk, 1996, (with Silverio Foresi),"                                                                        
##  [99] "Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"                                                               
## [100] "and Credit Spreads are Stochastic, 1996, "                                                                                       
## [101] "v5(2), 161-198."                                                                                                                 
## [102] "Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier, (2011), "          
## [103] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "                                                                     
## [104] "News Analytics: Framework, Techniques and Metrics, The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [105] "Random Lattices for Option Pricing Problems in Finance, (2011),"                                                                 
## [106] "Implementing Option Pricing Models using Python and Cython, (2010),"                                                             
## [107] "The Finance Web: Internet Information and Markets, (2010), "                                                                     
## [108] "Financial Applications with Parallel R, (2009), "                                                                                
## [109] "Recovery Swaps, (2009), (with Paul Hanouna),  "                                                                                  
## [110] "Recovery Rates, (2009),(with Paul Hanouna), "                                                                                    
## [111] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "                                                 
## [112] "Credit Default Swap Spreads, 2006, (with Paul Hanouna), "                                                                        
## [113] "Multiple-Core Processors for Finance Applications, 2006, "                                                                       
## [114] "Power Laws, 2005, (with Jacob Sisk), "                                                                                           
## [115] "Genetic Algorithms, 2005,"                                                                                                       
## [116] "Recovery Risk, 2005,"                                                                                                            
## [117] "Venture Capital Syndication, (with Hoje Jo and Yongtae Kim), 2004"                                                               
## [118] "Technical Analysis, (with David Tien), 2004"                                                                                     
## [119] "Liquidity and the Bond Markets, (with Jan Ericsson and "                                                                         
## [120] "Madhu Kalimipalli), 2003,"                                                                                                       
## [121] "Modern Pricing of Interest Rate Derivatives - Book Review, "                                                                     
## [122] "Contagion, 2003,"                                                                                                                
## [123] "Hedge Funds, 2003,"                                                                                                              
## [124] "Reprinted in "                                                                                                                   
## [125] "Working Papers on Hedge Funds, in The World of Hedge Funds: "                                                                    
## [126] "Characteristics and "                                                                                                            
## [127] "Analysis, 2005, World Scientific."                                                                                               
## [128] "The Internet and Investors, 2003,"                                                                                               
## [129] "  Useful things to know about Correlated Default Risk,"                                                                          
## [130] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                             
## [131] "The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "                                                    
## [132] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"                                                                       
## [133] "Courant Institute of Mathematical Sciences, special volume on"                                                                   
## [134] "A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                    
## [135] "(with Rangarajan Sundaram), reprinted in "                                                                                       
## [136] "the Courant Institute of Mathematical Sciences, special volume on"                                                               
## [137] "Stochastic Mean Models of the Term Structure,''"                                                                                 
## [138] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "                                                            
## [139] "John Wiley & Sons, Inc., 128-161."                                                                                               
## [140] "Interest Rate Modeling with Jump-Diffusion Processes,'' "                                                                        
## [141] "John Wiley & Sons, Inc., 162-189."                                                                                               
## [142] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"                                                               
## [143] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"                                                        
## [144] "Froot (Ed.), University of Chicago Press, 1999, 141-145."                                                                        
## [145] "  Pricing Credit Derivatives,'' "                                                                                                
## [146] "J. Frost and J.G. Whittaker, 101-138."                                                                                           
## [147] "On the Recursive Implementation of Term Structure Models,'' "                                                                    
## [148] "Efficient Rebalancing of Taxable Portfolios (with Dan Ostrov, Dennis Ding, Vincent Newell), "                                    
## [149] "Rollover Risk and Capital Structure Covenants in Structured Finance Vehicles,"                                                   
## [150] " (with Seoyoung Kim), "                                                                                                          
## [151] "Liability Directed Investing in a Behavioral Portfolio Theory Framework"                                                         
## [152] " (with Seoyoung Kim and Meir Statman), "                                                                                         
## [153] "The Fast and the Curious: VC Drift "                                                                                             
## [154] "(with Amit Bubna and Paul Hanouna), "                                                                                            
## [155] "Venture Capital Communities (with Amit Bubna and Nagpurnanand Prabhala), "                                                       
## [156] "                                                "

Take a look at the text now to see how cleaned up it is. But there is a better way, i.e., use the text-mining package tm.

Text Mining with the “tm” Package

  1. The R programming language supports a text-mining package, succinctly named {tm}. Using functions such as {readDOC()}, {readPDF()}, etc., for reading DOC and PDF files, the package makes accessing various file formats easy.

  2. Text mining involves applying functions to many text documents. A library of text documents (irrespective of format) is called a corpus. The essential and highly useful feature of text mining packages is the ability to operate on the entire set of documents at one go.

library(tm)
## Loading required package: NLP
text = c("INTL is expected to announce good earnings report", "AAPL first quarter disappoints","GOOG announces new wallet", "YHOO ascends from old ways")
text_corpus = Corpus(VectorSource(text))
print(text_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4
writeCorpus(text_corpus)

The writeCorpus() function in tm creates separate text files on the hard drive, and by default are names 1.txt, 2.txt, etc. The simple program code above shows how text scraped off a web page and collapsed into a single character string for each document, may then be converted into a corpus of documents using the Corpus() function.

It is easy to inspect the corpus as follows:

inspect(text_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 49
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 30
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 25
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 26

A second example

Here we use lapply to inspect the contents of the corpus.

#USING THE tm PACKAGE
library(tm)
text = c("Doc1;","This is doc2 --", "And, then Doc3.")
ctext = Corpus(VectorSource(text))
ctext
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
#writeCorpus(ctext)

#THE CORPUS IS A LIST OBJECT in R of type VCorpus or Corpus
inspect(ctext)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 5
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15
print(as.character(ctext[[1]]))
## [1] "Doc1;"
print(lapply(ctext[1:2],as.character))
## $`1`
## [1] "Doc1;"
## 
## $`2`
## [1] "This is doc2 --"
ctext = tm_map(ctext,tolower)  #Lower case all text in all docs
inspect(ctext)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] doc1;
## 
## [[2]]
## [1] this is doc2 --
## 
## [[3]]
## [1] and, then doc3.
ctext2 = tm_map(ctext,toupper)
inspect(ctext2)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] DOC1;
## 
## [[2]]
## [1] THIS IS DOC2 --
## 
## [[3]]
## [1] AND, THEN DOC3.

Function tm_map

#FIRST CURATE TO UPPER CASE
dropWords = c("IS","AND","THEN")
ctext2 = tm_map(ctext2,removeWords,dropWords)
inspect(ctext2)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] DOC1;
## 
## [[2]]
## [1] THIS  DOC2 --
## 
## [[3]]
## [1] ,  DOC3.
ctext = Corpus(VectorSource(text))
temp = ctext
print(lapply(temp,as.character))
## $`1`
## [1] "Doc1;"
## 
## $`2`
## [1] "This is doc2 --"
## 
## $`3`
## [1] "And, then Doc3."
temp = tm_map(temp,removeWords,stopwords("english"))
print(lapply(temp,as.character))
## $`1`
## [1] "Doc1;"
## 
## $`2`
## [1] "This  doc2 --"
## 
## $`3`
## [1] "And,  Doc3."
temp = tm_map(temp,removePunctuation)
print(lapply(temp,as.character))
## $`1`
## [1] "Doc1"
## 
## $`2`
## [1] "This  doc2 "
## 
## $`3`
## [1] "And  Doc3"
temp = tm_map(temp,removeNumbers)
print(lapply(temp,as.character))
## $`1`
## [1] "Doc"
## 
## $`2`
## [1] "This  doc "
## 
## $`3`
## [1] "And  Doc"

Bag of Words

We can create a bag of words by collapsing all the text into one bundle.

#CONVERT CORPUS INTO ARRAY OF STRINGS AND FLATTEN
txt = NULL
for (j in 1:length(temp)) {
  txt = c(txt,temp[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)
## [1] "doc this  doc  and  doc"

Example (on my bio page)

Now we will do a full pass through of this on my bio.

text = readLines("http://srdas.github.io/bio-candid.html")
ctext = Corpus(VectorSource(text))
ctext
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 79
print(lapply(ctext, as.character))
## $`1`
## [1] "<HTML>"
## 
## $`2`
## [1] "<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">"
## 
## $`3`
## [1] ""
## 
## $`4`
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
## 
## $`5`
## [1] "Santa Clara University's Leavey School of Business. He previously held"
## 
## $`6`
## [1] "faculty appointments as Associate Professor at Harvard Business School"
## 
## $`7`
## [1] "and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and"
## 
## $`8`
## [1] "Ph.D. from New York University), Computer Science (M.S. from UC"
## 
## $`9`
## [1] "Berkeley), an MBA from the Indian Institute of Management, Ahmedabad,"
## 
## $`10`
## [1] "B.Com in Accounting and Economics (University of Bombay, Sydenham"
## 
## $`11`
## [1] "College), and is also a qualified Cost and Works Accountant. He is a"
## 
## $`12`
## [1] "senior editor of The Journal of Investment Management, co-editor of"
## 
## $`13`
## [1] "The Journal of Derivatives and The Journal of Financial Services"
## 
## $`14`
## [1] "Research, and Associate Editor of other academic journals. Prior to"
## 
## $`15`
## [1] "being an academic, he worked in the derivatives business in the"
## 
## $`16`
## [1] "Asia-Pacific region as a Vice-President at Citibank. His current"
## 
## $`17`
## [1] "research interests include: the modeling of default risk, machine"
## 
## $`18`
## [1] "learning, social networks, derivatives pricing models, portfolio"
## 
## $`19`
## [1] "theory, and venture capital. He has published over eighty articles in"
## 
## $`20`
## [1] "academic journals, and has won numerous awards for research and"
## 
## $`21`
## [1] "teaching. His recent book \"Derivatives: Principles and Practice\" was"
## 
## $`22`
## [1] "published in May 2010.  He currently also serves as a Senior Fellow at"
## 
## $`23`
## [1] "the FDIC Center for Financial Research."
## 
## $`24`
## [1] ""
## 
## $`25`
## [1] ""
## 
## $`26`
## [1] "<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>"
## 
## $`27`
## [1] ""
## 
## $`28`
## [1] "After loafing and working in many parts of Asia, but never really"
## 
## $`29`
## [1] "growing up, Sanjiv moved to New York to change the world, hopefully"
## 
## $`30`
## [1] "through research.  He graduated in 1994 with a Ph.D. from NYU, and"
## 
## $`31`
## [1] "since then spent five years in Boston, and now lives in San Jose,"
## 
## $`32`
## [1] "California.  Sanjiv loves animals, places in the world where the"
## 
## $`33`
## [1] "mountains meet the sea, riding sport motorbikes, reading, gadgets,"
## 
## $`34`
## [1] "science fiction movies, and writing cool software code. When there is"
## 
## $`35`
## [1] "time available from the excitement of daily life, Sanjiv writes"
## 
## $`36`
## [1] "academic papers, which helps him relax. Always the contrarian, Sanjiv"
## 
## $`37`
## [1] "thinks that New York City is the most calming place in the world,"
## 
## $`38`
## [1] "after California of course."
## 
## $`39`
## [1] ""
## 
## $`40`
## [1] "<p>"
## 
## $`41`
## [1] ""
## 
## $`42`
## [1] "Sanjiv is now a Professor of Finance at Santa Clara University. He came"
## 
## $`43`
## [1] "to SCU from Harvard Business School and spent a year at UC Berkeley. In"
## 
## $`44`
## [1] "his past life in the unreal world, Sanjiv worked at Citibank, N.A. in"
## 
## $`45`
## [1] "the Asia-Pacific region. He takes great pleasure in merging his many"
## 
## $`46`
## [1] "previous lives into his current existence, which is incredibly confused"
## 
## $`47`
## [1] "and diverse."
## 
## $`48`
## [1] ""
## 
## $`49`
## [1] "<p>"
## 
## $`50`
## [1] ""
## 
## $`51`
## [1] "Sanjiv's research style is instilled with a distinct \"New York state of"
## 
## $`52`
## [1] "mind\" - it is chaotic, diverse, with minimal method to the madness. He"
## 
## $`53`
## [1] "has published articles on derivatives, term-structure models, mutual"
## 
## $`54`
## [1] "funds, the internet, portfolio choice, banking models, credit risk, and"
## 
## $`55`
## [1] "has unpublished articles in many other areas. Some years ago, he took"
## 
## $`56`
## [1] "time off to get another degree in computer science at Berkeley,"
## 
## $`57`
## [1] "confirming that an unchecked hobby can quickly become an obsession."
## 
## $`58`
## [1] "There he learnt about the fascinating field of Randomized Algorithms,"
## 
## $`59`
## [1] "skills he now applies earnestly to his editorial work, and other"
## 
## $`60`
## [1] "pursuits, many of which stem from being in the epicenter of Silicon"
## 
## $`61`
## [1] "Valley."
## 
## $`62`
## [1] ""
## 
## $`63`
## [1] "<p>"
## 
## $`64`
## [1] ""
## 
## $`65`
## [1] "Coastal living did a lot to mold Sanjiv, who needs to live near the"
## 
## $`66`
## [1] "ocean.  The many walks in Greenwich village convinced him that there is"
## 
## $`67`
## [1] "no such thing as a representative investor, yet added many unique"
## 
## $`68`
## [1] "features to his personal utility function. He learnt that it is"
## 
## $`69`
## [1] "important to open the academic door to the ivory tower and let the world"
## 
## $`70`
## [1] "in. Academia is a real challenge, given that he has to reconcile many"
## 
## $`71`
## [1] "more opinions than ideas. He has been known to have turned down many"
## 
## $`72`
## [1] "offers from Mad magazine to publish his academic work. As he often"
## 
## $`73`
## [1] "explains, you never really finish your education - \"you can check out"
## 
## $`74`
## [1] "any time you like, but you can never leave.\" Which is why he is doomed"
## 
## $`75`
## [1] "to a lifetime in Hotel California. And he believes that, if this is as"
## 
## $`76`
## [1] "bad as it gets, life is really pretty good."
## 
## $`77`
## [1] ""
## 
## $`78`
## [1] ""
## 
## $`79`
## [1] ""
ctext = tm_map(ctext,removePunctuation)
print(lapply(ctext, as.character))
## $`1`
## [1] "HTML"
## 
## $`2`
## [1] "BODY backgroundhttpalgoscuedusanjivdasgraphicsback2gif"
## 
## $`3`
## [1] ""
## 
## $`4`
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
## 
## $`5`
## [1] "Santa Clara Universitys Leavey School of Business He previously held"
## 
## $`6`
## [1] "faculty appointments as Associate Professor at Harvard Business School"
## 
## $`7`
## [1] "and UC Berkeley He holds postgraduate degrees in Finance MPhil and"
## 
## $`8`
## [1] "PhD from New York University Computer Science MS from UC"
## 
## $`9`
## [1] "Berkeley an MBA from the Indian Institute of Management Ahmedabad"
## 
## $`10`
## [1] "BCom in Accounting and Economics University of Bombay Sydenham"
## 
## $`11`
## [1] "College and is also a qualified Cost and Works Accountant He is a"
## 
## $`12`
## [1] "senior editor of The Journal of Investment Management coeditor of"
## 
## $`13`
## [1] "The Journal of Derivatives and The Journal of Financial Services"
## 
## $`14`
## [1] "Research and Associate Editor of other academic journals Prior to"
## 
## $`15`
## [1] "being an academic he worked in the derivatives business in the"
## 
## $`16`
## [1] "AsiaPacific region as a VicePresident at Citibank His current"
## 
## $`17`
## [1] "research interests include the modeling of default risk machine"
## 
## $`18`
## [1] "learning social networks derivatives pricing models portfolio"
## 
## $`19`
## [1] "theory and venture capital He has published over eighty articles in"
## 
## $`20`
## [1] "academic journals and has won numerous awards for research and"
## 
## $`21`
## [1] "teaching His recent book Derivatives Principles and Practice was"
## 
## $`22`
## [1] "published in May 2010  He currently also serves as a Senior Fellow at"
## 
## $`23`
## [1] "the FDIC Center for Financial Research"
## 
## $`24`
## [1] ""
## 
## $`25`
## [1] ""
## 
## $`26`
## [1] "p BSanjiv Das A Short Academic Life HistoryB p"
## 
## $`27`
## [1] ""
## 
## $`28`
## [1] "After loafing and working in many parts of Asia but never really"
## 
## $`29`
## [1] "growing up Sanjiv moved to New York to change the world hopefully"
## 
## $`30`
## [1] "through research  He graduated in 1994 with a PhD from NYU and"
## 
## $`31`
## [1] "since then spent five years in Boston and now lives in San Jose"
## 
## $`32`
## [1] "California  Sanjiv loves animals places in the world where the"
## 
## $`33`
## [1] "mountains meet the sea riding sport motorbikes reading gadgets"
## 
## $`34`
## [1] "science fiction movies and writing cool software code When there is"
## 
## $`35`
## [1] "time available from the excitement of daily life Sanjiv writes"
## 
## $`36`
## [1] "academic papers which helps him relax Always the contrarian Sanjiv"
## 
## $`37`
## [1] "thinks that New York City is the most calming place in the world"
## 
## $`38`
## [1] "after California of course"
## 
## $`39`
## [1] ""
## 
## $`40`
## [1] "p"
## 
## $`41`
## [1] ""
## 
## $`42`
## [1] "Sanjiv is now a Professor of Finance at Santa Clara University He came"
## 
## $`43`
## [1] "to SCU from Harvard Business School and spent a year at UC Berkeley In"
## 
## $`44`
## [1] "his past life in the unreal world Sanjiv worked at Citibank NA in"
## 
## $`45`
## [1] "the AsiaPacific region He takes great pleasure in merging his many"
## 
## $`46`
## [1] "previous lives into his current existence which is incredibly confused"
## 
## $`47`
## [1] "and diverse"
## 
## $`48`
## [1] ""
## 
## $`49`
## [1] "p"
## 
## $`50`
## [1] ""
## 
## $`51`
## [1] "Sanjivs research style is instilled with a distinct New York state of"
## 
## $`52`
## [1] "mind  it is chaotic diverse with minimal method to the madness He"
## 
## $`53`
## [1] "has published articles on derivatives termstructure models mutual"
## 
## $`54`
## [1] "funds the internet portfolio choice banking models credit risk and"
## 
## $`55`
## [1] "has unpublished articles in many other areas Some years ago he took"
## 
## $`56`
## [1] "time off to get another degree in computer science at Berkeley"
## 
## $`57`
## [1] "confirming that an unchecked hobby can quickly become an obsession"
## 
## $`58`
## [1] "There he learnt about the fascinating field of Randomized Algorithms"
## 
## $`59`
## [1] "skills he now applies earnestly to his editorial work and other"
## 
## $`60`
## [1] "pursuits many of which stem from being in the epicenter of Silicon"
## 
## $`61`
## [1] "Valley"
## 
## $`62`
## [1] ""
## 
## $`63`
## [1] "p"
## 
## $`64`
## [1] ""
## 
## $`65`
## [1] "Coastal living did a lot to mold Sanjiv who needs to live near the"
## 
## $`66`
## [1] "ocean  The many walks in Greenwich village convinced him that there is"
## 
## $`67`
## [1] "no such thing as a representative investor yet added many unique"
## 
## $`68`
## [1] "features to his personal utility function He learnt that it is"
## 
## $`69`
## [1] "important to open the academic door to the ivory tower and let the world"
## 
## $`70`
## [1] "in Academia is a real challenge given that he has to reconcile many"
## 
## $`71`
## [1] "more opinions than ideas He has been known to have turned down many"
## 
## $`72`
## [1] "offers from Mad magazine to publish his academic work As he often"
## 
## $`73`
## [1] "explains you never really finish your education  you can check out"
## 
## $`74`
## [1] "any time you like but you can never leave Which is why he is doomed"
## 
## $`75`
## [1] "to a lifetime in Hotel California And he believes that if this is as"
## 
## $`76`
## [1] "bad as it gets life is really pretty good"
## 
## $`77`
## [1] ""
## 
## $`78`
## [1] ""
## 
## $`79`
## [1] ""
txt = NULL
for (j in 1:length(ctext)) {
  txt = c(txt,ctext[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)
## [1] "html body backgroundhttpalgoscuedusanjivdasgraphicsback2gif  sanjiv das is the william and janice terry professor of finance at santa clara universitys leavey school of business he previously held faculty appointments as associate professor at harvard business school and uc berkeley he holds postgraduate degrees in finance mphil and phd from new york university computer science ms from uc berkeley an mba from the indian institute of management ahmedabad bcom in accounting and economics university of bombay sydenham college and is also a qualified cost and works accountant he is a senior editor of the journal of investment management coeditor of the journal of derivatives and the journal of financial services research and associate editor of other academic journals prior to being an academic he worked in the derivatives business in the asiapacific region as a vicepresident at citibank his current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital he has published over eighty articles in academic journals and has won numerous awards for research and teaching his recent book derivatives principles and practice was published in may 2010  he currently also serves as a senior fellow at the fdic center for financial research   p bsanjiv das a short academic life historyb p  after loafing and working in many parts of asia but never really growing up sanjiv moved to new york to change the world hopefully through research  he graduated in 1994 with a phd from nyu and since then spent five years in boston and now lives in san jose california  sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code when there is time available from the excitement of daily life sanjiv writes academic papers which helps him relax always the contrarian sanjiv thinks that new york city is the most calming place in the world after california of course  p  sanjiv is now a professor of finance at santa clara university he came to scu from harvard business school and spent a year at uc berkeley in his past life in the unreal world sanjiv worked at citibank na in the asiapacific region he takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse  p  sanjivs research style is instilled with a distinct new york state of mind  it is chaotic diverse with minimal method to the madness he has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas some years ago he took time off to get another degree in computer science at berkeley confirming that an unchecked hobby can quickly become an obsession there he learnt about the fascinating field of randomized algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of silicon valley  p  coastal living did a lot to mold sanjiv who needs to live near the ocean  the many walks in greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function he learnt that it is important to open the academic door to the ivory tower and let the world in academia is a real challenge given that he has to reconcile many more opinions than ideas he has been known to have turned down many offers from mad magazine to publish his academic work as he often explains you never really finish your education  you can check out any time you like but you can never leave which is why he is doomed to a lifetime in hotel california and he believes that if this is as bad as it gets life is really pretty good   "

Term Document Matrix (TDM)

An extremeley important object in text analysis is the Term-Document Matrix. This allows us to store an entire library of text inside a single matrix. This may then be used for analysis as well as searching documents. It forms the basis of search engines, topic analysis, and classification (spam filtering).

It is a table that provides the frequency count of every word (term) in each document. The number of rows in the TDM is equal to the number of unique terms, and the number of columns is equal to the number of documents.

#TERM-DOCUMENT MATRIX
tdm = TermDocumentMatrix(ctext,control=list(minWordLength=1))
print(tdm)
## <<TermDocumentMatrix (terms: 317, documents: 79)>>
## Non-/sparse entries: 497/24546
## Sparsity           : 98%
## Maximal term length: 49
## Weighting          : term frequency (tf)
inspect(tdm[10:20,11:18])
## <<TermDocumentMatrix (terms: 11, documents: 8)>>
## Non-/sparse entries: 4/84
## Sparsity           : 95%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 
##               Docs
## Terms          11 12 13 14 15 16 17 18
##   ago           0  0  0  0  0  0  0  0
##   ahmedabad     0  0  0  0  0  0  0  0
##   algorithms    0  0  0  0  0  0  0  0
##   also          1  0  0  0  0  0  0  0
##   always        0  0  0  0  0  0  0  0
##   and           2  0  1  1  0  0  0  0
##   animals       0  0  0  0  0  0  0  0
##   another       0  0  0  0  0  0  0  0
##   any           0  0  0  0  0  0  0  0
##   applies       0  0  0  0  0  0  0  0
##   appointments  0  0  0  0  0  0  0  0
out = findFreqTerms(tdm,lowfreq=5)
print(out)
##  [1] "academic"    "and"         "derivatives" "from"        "has"        
##  [6] "his"         "many"        "research"    "sanjiv"      "that"       
## [11] "the"         "world"

Term Frequency - Inverse Document Frequency (TF-IDF)

This is a weighting scheme provided to sharpen the importance of rare words in a document, relative to the frequency of these words in the corpus. It is based on simple calculations and even though it does not have strong theoretical foundations, it is still very useful in practice. The TF-IDF is the importance of a word \(w\) in a document \(d\) in a corpus \(C\). Therefore it is a function of all these three, i.e., we write it as TF-IDF\((w,d,C)\), and is the product of term frequency (TF) and inverse document frequency (IDF).

The frequency of a word in a document is defined as \[ f(w,d) = \frac{\#w \in d}{|d|} \] where \(|d|\) is the number of words in the document. We usually normalize word frequency so that \[ TF(w,d) = \ln[f(w,d)] \] This is log normalization. Another form of normalization is known as double normalization and is as follows: \[ TF(w,d) = \frac{1}{2} + \frac{1}{2} \frac{f(w,d)}{\max_{w \in d} f(w,d)} \] Note that normalization is not necessary, but it tends to help shrink the difference between counts of words.

Inverse document frequency is as follows: \[ IDF(w,C) = \ln\left[ \frac{|C|}{|d_{w \in d}|} \right] \] That is, we compute the ratio of the number of documents in the corpus \(C\) divided by the number of documents with word \(w\) in the corpus.

Finally, we have the weighting score for a given word \(w\) in document \(d\) in corpus \(C\): \[ \mbox{TF-IDF}(w,d,C) = TF(w,d) \times IDF(w,C) \]

Example of TD-IDF

We illustrate this with an application to the previously computed term-document matrix.

tdm_mat = as.matrix(tdm)  #Convert tdm into a matrix
print(dim(tdm_mat))
## [1] 317  79
nw = dim(tdm_mat)[1]
nd = dim(tdm_mat)[2]
doc = 13   #Choose document
word = "derivatives"   #Choose word

#COMPUTE TF
f = NULL
for (w in row.names(tdm_mat)) {
    f = c(f,tdm_mat[w,doc]/sum(tdm_mat[,doc]))
}
fw = tdm_mat[word,doc]/sum(tdm_mat[,doc])
TF = 0.5 + 0.5*fw/max(f)
print(TF)
## [1] 0.75
#COMPUTE IDF
nw = length(which(tdm_mat[word,]>0))
print(nw)
## [1] 5
IDF = nd/nw
print(IDF)
## [1] 15.8
#COMPUTE TF-IDF
TF_IDF = TF*IDF
print(TF_IDF)  #With normalization
## [1] 11.85
print(fw*IDF)   #Without normalization
## [1] 1.975

We can write this code into a function and work out the TF-IDF for all words. Then these word weights may be used in further text analysis.

TF-IDF in the tm package

We may also directly use the weightTfIdf function in the tm package. This undertakes the following computation:

Example:

library(tm)
textarray = c("Free software comes with ABSOLUTELY NO certain WARRANTY","You are welcome to redistribute free software under certain conditions","Natural language support for software in an English locale","A collaborative project with many contributors")
textcorpus = Corpus(VectorSource(textarray))
m = TermDocumentMatrix(textcorpus)
print(as.matrix(m))
##                Docs
## Terms           1 2 3 4
##   absolutely    1 0 0 0
##   are           0 1 0 0
##   certain       1 1 0 0
##   collaborative 0 0 0 1
##   comes         1 0 0 0
##   conditions    0 1 0 0
##   contributors  0 0 0 1
##   english       0 0 1 0
##   for           0 0 1 0
##   free          1 1 0 0
##   language      0 0 1 0
##   locale        0 0 1 0
##   many          0 0 0 1
##   natural       0 0 1 0
##   project       0 0 0 1
##   redistribute  0 1 0 0
##   software      1 1 1 0
##   support       0 0 1 0
##   under         0 1 0 0
##   warranty      1 0 0 0
##   welcome       0 1 0 0
##   with          1 0 0 1
##   you           0 1 0 0
print(as.matrix(weightTfIdf(m)))
##                Docs
## Terms                    1          2          3   4
##   absolutely    0.28571429 0.00000000 0.00000000 0.0
##   are           0.00000000 0.22222222 0.00000000 0.0
##   certain       0.14285714 0.11111111 0.00000000 0.0
##   collaborative 0.00000000 0.00000000 0.00000000 0.4
##   comes         0.28571429 0.00000000 0.00000000 0.0
##   conditions    0.00000000 0.22222222 0.00000000 0.0
##   contributors  0.00000000 0.00000000 0.00000000 0.4
##   english       0.00000000 0.00000000 0.28571429 0.0
##   for           0.00000000 0.00000000 0.28571429 0.0
##   free          0.14285714 0.11111111 0.00000000 0.0
##   language      0.00000000 0.00000000 0.28571429 0.0
##   locale        0.00000000 0.00000000 0.28571429 0.0
##   many          0.00000000 0.00000000 0.00000000 0.4
##   natural       0.00000000 0.00000000 0.28571429 0.0
##   project       0.00000000 0.00000000 0.00000000 0.4
##   redistribute  0.00000000 0.22222222 0.00000000 0.0
##   software      0.05929107 0.04611528 0.05929107 0.0
##   support       0.00000000 0.00000000 0.28571429 0.0
##   under         0.00000000 0.22222222 0.00000000 0.0
##   warranty      0.28571429 0.00000000 0.00000000 0.0
##   welcome       0.00000000 0.22222222 0.00000000 0.0
##   with          0.14285714 0.00000000 0.00000000 0.2
##   you           0.00000000 0.22222222 0.00000000 0.0

Using the ANLP package for bigrams and trigrams

This package has a few additional functions that make the preceding ideas more streamlined to implement. First let’s read in the usual text.

library(ANLP)
## Warning: package 'ANLP' was built under R version 3.2.5
## Loading required package: qdap
## Warning: package 'qdap' was built under R version 3.2.5
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:stringr':
## 
##     %>%
## The following object is masked from 'package:base':
## 
##     Filter
## Loading required package: RWeka
## Warning: package 'RWeka' was built under R version 3.2.4
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:qdap':
## 
##     %>%
## The following object is masked from 'package:qdapTools':
## 
##     id
## The following objects are masked from 'package:qdapRegex':
## 
##     escape, explain
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: replacing previous import by 'tm::TermDocumentMatrix' when loading
## 'ANLP'
download.file("http://srdas.github.io/bio-candid.html",destfile = "text")
text = readTextFile("text","UTF-8")
ctext = cleanTextData(text)  #Creates a text corpus

The last function removes non-english characters, numbers, white spaces, brackets, punctuation. It also handles cases like abbreviation, contraction. It converts entire text to lower case.

We now make TDMs for unigrams, bigrams, trigrams. Then, combine them all into one list for word prediction.

g1 = generateTDM(ctext,1)
g2 = generateTDM(ctext,2)
g3 = generateTDM(ctext,3)
gmodel = list(g1,g2,g3)

Next, use the back-off algorithm to predict the next sequence of words.

print(predict_Backoff("you never",gmodel))
## [1] "leave"
print(predict_Backoff("life is",gmodel))
## [1] "also"
print(predict_Backoff("been known",gmodel))
## [1] "to"
print(predict_Backoff("needs to",gmodel))
## [1] "a"
print(predict_Backoff("worked at",gmodel))
## [1] "harvard"
print(predict_Backoff("being an",gmodel))
## [1] "academic"
print(predict_Backoff("publish",gmodel))
## [1] "in"

Wordclouds

Wordlcouds are interesting ways in which to represent text. They give an instant visual summary. The wordcloud package in R may be used to create your own wordclouds.

#MAKE A WORDCLOUD
library(wordcloud)
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)

#REMOVE STOPWORDS, NUMBERS, STEMMING
ctext1 = tm_map(ctext,removeWords,stopwords("english"))
ctext1 = tm_map(ctext1, removeNumbers)
tdm = TermDocumentMatrix(ctext1,control=list(minWordLength=1))
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)

Stemming

Stemming is the procedure by which a word is reduced to its root or stem. This is done so as to treat words from the one stem as the same word, rather than as separate words. We do not want “eaten” and “eating” to be treated as different words for example.

#STEMMING
ctext2 = tm_map(ctext,removeWords,stopwords("english"))
ctext2 = tm_map(ctext2, stemDocument)
print(lapply(ctext2, as.character))
## $`1`
##  [1] ""                                                         
##  [2] ""                                                         
##  [3] ""                                                         
##  [4] "sanjiv das   william  janic terri professor  financ"      
##  [5] "santa clara univers leavey school  busi  previous held"   
##  [6] "faculti appoint  associ professor  harvard busi school"   
##  [7] " uc berkeley  hold postgradu degre  financ mphil"         
##  [8] "phd  new york univers comput scienc ms  uc"               
##  [9] "berkeley  mba   indian institut  manag ahmedabad"         
## [10] "bcom  account  econom univers  bombay sydenham"           
## [11] "colleg   also  qualifi cost  work account  "              
## [12] "senior editor   journal  invest manag coeditor"           
## [13] " journal  deriv   journal  financi servic"                
## [14] "research  associ editor   academ journal prior"           
## [15] "  academ  work   deriv busi "                             
## [16] "asiapacif region   vicepresid  citibank  current"         
## [17] "research interest includ  model  default risk machin"     
## [18] "learn social network deriv price model portfolio"         
## [19] "theori  ventur capit   publish  eighti articl"            
## [20] "academ journal   won numer award  research"               
## [21] "teach  recent book deriv principl  practic"               
## [22] "publish  may  current also serv   senior fellow"          
## [23] " fdic center  financi research"                           
## [24] ""                                                         
## [25] ""                                                         
## [26] "sanjiv das  short academ life histori"                    
## [27] ""                                                         
## [28] " loaf  work  mani part  asia  never realli"               
## [29] "grow  sanjiv move  new york  chang  world hope"           
## [30] " research  graduat    phd  nyu"                           
## [31] "sinc  spent five year  boston  now live  san jose"        
## [32] "california sanjiv love anim place   world "               
## [33] "mountain meet  sea ride sport motorbik read gadget"       
## [34] "scienc fiction movi  write cool softwar code  "           
## [35] "time avail   excit  daili life sanjiv write"              
## [36] "academ paper  help  relax alway  contrarian sanjiv"       
## [37] "think  new york citi    calm place   world"               
## [38] " california  cours"                                       
## [39] ""                                                         
## [40] ""                                                         
## [41] ""                                                         
## [42] "sanjiv  now  professor  financ  santa clara univers  came"
## [43] " scu  harvard busi school  spent  year  uc berkeley"      
## [44] " past life   unreal world sanjiv work  citibank na"       
## [45] " asiapacif region  take great pleasur  merg  mani"        
## [46] "previous live   current exist   incred confus"            
## [47] " divers"                                                  
## [48] ""                                                         
## [49] ""                                                         
## [50] ""                                                         
## [51] "sanjiv research style  instil   distinct new york state"  
## [52] "mind   chaotic divers  minim method   mad"                
## [53] " publish articl  deriv termstructur model mutual"         
## [54] "fund  internet portfolio choic bank model credit risk"    
## [55] " unpublish articl  mani  area  year ago  took"            
## [56] "time   get anoth degre  comput scienc  berkeley"          
## [57] "confirm   uncheck hobbi can quick becom  obsess"          
## [58] "  learnt   fascin field  random algorithm"                
## [59] "skill  now appli earnest   editori work "                 
## [60] "pursuit mani   stem     epicent  silicon"                 
## [61] "valley"                                                   
## [62] ""                                                         
## [63] ""                                                         
## [64] ""                                                         
## [65] "coastal live   lot  mold sanjiv  need  live near"         
## [66] "ocean  mani walk  greenwich villag convinc   "            
## [67] "  thing   repres investor yet ad mani uniqu"              
## [68] "featur   person util function  learnt  "                  
## [69] "import  open  academ door   ivori tower  let  world"      
## [70] " academia   real challeng given     reconcil mani"        
## [71] " opinion  idea    known   turn  mani"                     
## [72] "offer  mad magazin  publish  academ work   often"         
## [73] "explain  never realli finish  educ  can check"            
## [74] " time  like   can never leav      doom"                   
## [75] "  lifetim  hotel california   believ    "                 
## [76] "bad   get life  realli pretti good"                       
## [77] ""                                                         
## [78] ""                                                         
## [79] ""

Regular Expressions

Regular expressions are syntax used to modify strings in an efficient manner. They are complicated but extremely effective. Here we will illustrate with a few examples, but you are encouraged to explore more on your own, as the variations are endless. What you need to do will depend on the application at hand, and with some experience you will become better at using regular expressions. The initial use will however be somewhat confusing.

We start with a simple example of a text array where we wish replace the string “data” with a blank, i.e., we eliminate this string from the text we have.

library(tm)
#Create a text array
text = c("Doc1 is datavision","Doc2 is datatable","Doc3 is data","Doc4 is nodata","Doc5 is simpler")
print(text)
## [1] "Doc1 is datavision" "Doc2 is datatable"  "Doc3 is data"      
## [4] "Doc4 is nodata"     "Doc5 is simpler"
#Remove all strings with the chosen text for all docs
print(gsub("data","",text))
## [1] "Doc1 is vision"  "Doc2 is table"   "Doc3 is "        "Doc4 is no"     
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the start even if they are longer than data
print(gsub("*data.*","",text))
## [1] "Doc1 is "        "Doc2 is "        "Doc3 is "        "Doc4 is no"     
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data*","",text))
## [1] "Doc1 isvision"   "Doc2 istable"    "Doc3 is"         "Doc4 is n"      
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data.*","",text))
## [1] "Doc1 is"         "Doc2 is"         "Doc3 is"         "Doc4 is n"      
## [5] "Doc5 is simpler"

Complex Regular Expressions using grep

We now explore some more complex regular expressions. One case that is common is handling the search for special types of strings like telephone numbers. Suppose we have a text array that may contain telephone numbers in different formats, we can use a single grep command to extract these numbers. Here is some code to illustrate this.

#Create an array with some strings which may also contain telephone numbers as strings. 
x = c("234-5678","234 5678","2345678","1234567890","0123456789","abc 234-5678","234 5678 def","xx 2345678","abc1234567890def")

#Now use grep to find which elements of the array contain telephone numbers
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]",x)
print(idx)
## [1] 1 2 4 6 7 9
print(x[idx])
## [1] "234-5678"         "234 5678"         "1234567890"      
## [4] "abc 234-5678"     "234 5678 def"     "abc1234567890def"
#We can shorten this as follows
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}",x)
print(idx)
## [1] 1 2 4 6 7 9
print(x[idx])
## [1] "234-5678"         "234 5678"         "1234567890"      
## [4] "abc 234-5678"     "234 5678 def"     "abc1234567890def"
#What if we want to extract only the phone number and drop the rest of the text?
pattern = "[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}"
print(regmatches(x, gregexpr(pattern,x)))
## [[1]]
## [1] "234-5678"
## 
## [[2]]
## [1] "234 5678"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "1234567890"
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "234-5678"
## 
## [[7]]
## [1] "234 5678"
## 
## [[8]]
## character(0)
## 
## [[9]]
## [1] "1234567890"
#Or use the stringr package, which is a lot better
library(stringr)
str_extract(x,pattern)
## [1] "234-5678"   "234 5678"   NA           "1234567890" NA          
## [6] "234-5678"   "234 5678"   NA           "1234567890"

Using grep for emails

Now we use grep to extract emails by looking for the “@” sign in the text string. We would proceed as in the following example.

x = c("sanjiv das","srdas@scu.edu","SCU","data@science.edu")
print(grep("\\@",x))
## [1] 2 4
print(x[grep("\\@",x)])
## [1] "srdas@scu.edu"    "data@science.edu"

You get the idea. Using the functions gsub, grep, regmatches, and gregexpr, you can manage most fancy string handling that is needed.

Extracting Text from the Web using APIs

We now look to getting text from the web and using various APIs from different services like Twitter, Facebook, etc. You will need to open free developer accounts to do this on each site. You will also need the special R packages for each different source.

Twitter

The Twitter API needs a lot of handshaking…

##TWITTER EXTRACTOR
library(twitteR)
library(ROAuth)
library(RCurl)
download.file(url="https://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
#certificate file based on Privacy Enhanced Mail (PEM) protocol: https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail

cKey = "h4J3x0i5kgD58E1t5JCEnw"  #These are my keys and won't work for you
cSecret = "fi4SOHENNySeQKWe95SuBIRx74Xjv0Cx4EZx59QKwg"   #use your own secret
reqURL = "https://api.twitter.com/oauth/request_token"
accURL = "https://api.twitter.com/oauth/access_token"
authURL = "https://api.twitter.com/oauth/authorize"

#NOW SUBMIT YOUR CODES AND ASK FOR CREDENTIALS
cred = OAuthFactory$new(consumerKey=cKey, consumerSecret=cSecret,requestURL=reqURL, accessURL=accURL,authURL=authURL)
cred$handshake(cainfo="cacert.pem") #Asks for token

#Test and save credentials
#registerTwitterOAuth(cred)
#save(list="cred",file="twitteR_credentials")
#FIRST PHASE DONE

Accessing Twitter

##USE httr, SECOND PHASE
library(httr)
#options(httr_oauth_cache=T)
accToken = "18666236-DmDE1wwbpvPbDcw9kwt9yThGeyYhjfpVVywrHuhOQ"
accTokenSecret = "cttbpxpTtqJn7wrCP36I59omNI5GQHXXgV41sKwUgc"
setup_twitter_oauth(cKey,cSecret,accToken,accTokenSecret)  #At prompt type 1

This completes the handshaking with Twitter. Now we can access tweets using the functions in the twitteR package.

Using the twitteR package

#EXAMPLE 1
s = searchTwitter("#GOOG")  #This is a list
s

#CONVERT TWITTER LIST TO TEXT ARRAY (see documentation in twitteR package)
twts = twListToDF(s)  #This gives a dataframe with the tweets
names(twts)

twts_array = twts$text
print(twts$retweetCount)
twts_array

#EXAMPLE 2
s = getUser("srdas")
fr = s$getFriends()
print(length(fr))
print(fr[1:10])
s_tweets = userTimeline("srdas",n=20)
print(s_tweets)

getCurRateLimitInfo(c("srdas"))

Getting Streaming Data from Twitter

This assumes you have a working twitter account and have already connected R to it using twitteR package.

library(streamR)
filterStream(file.name = "tweets.json", # Save tweets in a json file
             track = "useR_Stanford" , # Collect tweets with useR_Stanford over 60 seconds. Can use twitter handles or keywords.
             language = "en",
             timeout = 30, # Keep connection alive for 60 seconds
             oauth = cred) # Use OAuth credentials

tweets.df <- parseTweets("tweets.json", simplify = FALSE) # parse the json file and save to a data frame called tweets.df. Simplify = FALSE ensures that we include lat/lon information in that data frame.

Retrieving tweets of a particular user over a 60 second time period

filterStream(file.name = "tweets.json", # Save tweets in a json file
             track = "3497513953" , # Collect tweets from useR2016 feed over 60 seconds. Must use twitter ID of the user.
             language = "en",
             timeout = 30, # Keep connection alive for 60 seconds
             oauth = cred) # Use my_oauth file as the OAuth credentials
tweets.df <- parseTweets("tweets.json", simplify = FALSE)

Streaming messages from the accounts your user follows.

userStream( file.name="my_timeline.json", with="followings",tweets=10, oauth=cred )

Facebook

Now we move on to using Facebook, which is a little less trouble than Twitter. Also the results may be used for creating interesting networks.

##FACEBOOK EXTRACTOR
library(Rfacebook)
library(SnowballC)
library(Rook)
library(ROAuth)
app_id = "847737771920076"   # USE YOUR OWN IDs
app_secret = "a120a2ec908d9e00fcd3c619cad7d043"
fb_oauth = fbOAuth(app_id,app_secret,extended_permissions=TRUE)
#save(fb_oauth,file="fb_oauth")

#DIRECT LOAD
load("fb_oauth")

Examples

##EXAMPLES
bbn = getUsers("bloombergnews",token=fb_oauth)
print(bbn)

page = getPage(page="bloombergnews",token=fb_oauth,n=20)
print(dim(page))

print(head(page))

print(names(page))

print(page$message)

print(page$message[11])

Yelp - Setting up an authorization

First we examine the protocol for connecting to the Yelp API. This assumes you have opei

###CODE to connect to YELP.
consumerKey = "z6w-Or6HSyKbdUTmV9lbOA"
consumerSecret = "ImUufP3yU9FmNWWx54NUbNEBcj8"
token = "mBzEBjhYIGgJZnmtTHLVdQ-0cyfFVRGu"
token_secret = "v0FGCL0TS_dFDWFwH3HptDZhiLE"

Yelp - handshaking with the API

require(httr)
require(httpuv)
require(jsonlite)
# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)
## Searching the top ten bars in Chicago and SF.
limit <- 10

# 10 bars in Chicago
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&location=Chicago%20IL&term=bar")
# or 10 bars by geo-coordinates
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&ll=37.788022,-122.399797&term=bar")

locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)
locationdataList=jsonlite::fromJSON(toJSON(locationdataContent))
head(data.frame(locationdataList))

for (j in 1:limit) {
  print(locationdataContent$businesses[[j]]$snippet_text)
}

Cosine Similarity in the Text Domain

In this segment we will learn some popular functions on text that are used in practice. One of the first things we like to do is to find similar text or like sentences (think of web search as one application). Since documents are vectors in the TDM, we may want to find the closest vectors or compute the distance between vectors.

\[ cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||} \]

where \(||A|| = \sqrt{A \cdot A}\), is the dot product of \(A\) with itself, also known as the norm of \(A\). This gives the cosine of the angle between the two vectors and is zero for orthogonal vectors and 1 for identical vectors.

#COSINE DISTANCE OR SIMILARITY
A = as.matrix(c(0,3,4,1,7,0,1))
B = as.matrix(c(0,4,3,0,6,1,1))
cos = t(A) %*% B / (sqrt(t(A)%*%A) * sqrt(t(B)%*%B))
print(cos)
##           [,1]
## [1,] 0.9682728
library(lsa)
## Loading required package: SnowballC
## 
## Attaching package: 'lsa'
## The following object is masked from 'package:dplyr':
## 
##     query
#THE COSINE FUNCTION IN LSA ONLY TAKES ARRAYS
A = c(0,3,4,1,7,0,1)
B = c(0,4,3,0,6,1,1)
print(cosine(A,B))
##           [,1]
## [1,] 0.9682728

Dictionaries - I

  1. Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”

  2. The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/

  3. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.

  4. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.

  5. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.

  6. Medical dictionary, see http://www.hyperdictionary.com/medical.

Dictionaries - II

  1. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.

  2. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.

  3. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.

Lexicons

  1. A lexicon is defined by Webster’s as “a book containing an alphabetical arrangement of the words in a language and their definitions; the vocabulary of a language, an individual speaker or group of speakers, or a subject; the total stock of morphemes in a language.” This suggests it is not that different from a dictionary.

  2. A “morpheme” is defined as “a word or a part of a word that has a meaning and that contains no smaller part that has a meaning.”

  3. In the text analytics realm, we will take a lexicon to be a smaller, special purpose dictionary, containing words that are relevant to the domain of interest.

  4. The benefit of a lexicon is that it enables focusing only on words that are relevant to the analytics and discards words that are not.

  5. Another benefit is that since it is a smaller dictionary, the computational effort required by text analytics algorithms is drastically reduced.

Constructing a lexicon

  1. By hand. This is an effective technique and the simplest. It calls for a human reader who scans a representative sample of text documents and culls important words that lend interpretive meaning.

  2. Examine the term document matrix for most frequent words, and pick the ones that have high connotation for the classification task at hand.

  3. Use pre-classified documents in a text corpus. We analyze the separate groups of documents to find words whose difference in frequency between groups is highest. Such words are likely to be better in discriminating between groups.

Lexicons as Word Lists

  1. Das and Chen (2007) constructed a lexicon of about 375 words that are useful in parsing sentiment from stock message boards. This lexicon also introduced the notion of “negation tagging” into the literature.

  2. Loughran and McDonald (2011):

Scoring Text

Mood Scoring using Harvard Inquirer

Creating Positive and Negative Word Lists

#MOOD SCORING USING HARVARD INQUIRER
#Read in the Harvard Inquirer Dictionary
#And create a list of positive and negative words
HIDict = readLines("inqdict.txt")
dict_pos = HIDict[grep("Pos",HIDict)]
poswords = NULL
for (s in dict_pos) {
    s = strsplit(s,"#")[[1]][1]
    poswords = c(poswords,strsplit(s," ")[[1]][1])
}
dict_neg = HIDict[grep("Neg",HIDict)]
negwords = NULL
for (s in dict_neg) {
    s = strsplit(s,"#")[[1]][1]
    negwords = c(negwords,strsplit(s," ")[[1]][1])
}
poswords = tolower(poswords)
negwords = tolower(negwords)
print(sample(poswords,25))
##  [1] "casual"        "unimpeachable" "inventor"      "zest"         
##  [5] "promise"       "like"          "justifiably"   "gain"         
##  [9] "fairness"      "redemption"    "shelter"       "comical"      
## [13] "ally"          "advisable"     "therapeutic"   "merry"        
## [17] "improve"       "calm"          "distinct"      "honorable"    
## [21] "tradition"     "indescribable" "enlightenment" "exult"        
## [25] "excel"
print(sample(negwords,25))
##  [1] "recoil"        "incredibility" "treasonous"    "stalemate"    
##  [5] "vex"           "rejection"     "gamble"        "indignation"  
##  [9] "criminal"      "discomfort"    "fraudulent"    "muddy"        
## [13] "complex"       "hypocrite"     "nebulous"      "infuriate"    
## [17] "pollution"     "mishap"        "bribe"         "maladjustment"
## [21] "rid"           "subside"       "nobody"        "segregation"  
## [25] "incompetent"
poswords = unique(poswords)
negwords = unique(negwords)
print(length(poswords))
## [1] 1647
print(length(negwords))
## [1] 2121

The preceding code created two arrays, one of positive words and another of negative words.

One Function to Rule All Text

In order to score text, we need to clean it first and put it into an array to compare with the word list of positive and negative words. I wrote a general purpose function that grabs text and cleans it up for further use.

library(tm)
library(stringr)
#READ IN TEXT FOR ANALYSIS, PUT IT IN A CORPUS, OR ARRAY, OR FLAT STRING
#cstem=1, if stemming needed
#cstop=1, if stopwords to be removed
#ccase=1 for lower case, ccase=2 for upper case
#cpunc=1, if punctuation to be removed
#cflat=1 for flat text wanted, cflat=2 if text array, else returns corpus
read_web_page = function(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=0) {
    text = readLines(url)
    text = text[setdiff(seq(1,length(text)),grep("<",text))]
    text = text[setdiff(seq(1,length(text)),grep(">",text))]
    text = text[setdiff(seq(1,length(text)),grep("]",text))]
    text = text[setdiff(seq(1,length(text)),grep("}",text))]
    text = text[setdiff(seq(1,length(text)),grep("_",text))]
    text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
    ctext = Corpus(VectorSource(text))
    if (cstem==1) { ctext = tm_map(ctext, stemDocument) }
    if (cstop==1) { ctext = tm_map(ctext, removeWords, stopwords("english"))}
    if (cpunc==1) { ctext = tm_map(ctext, removePunctuation) }
    if (ccase==1) { ctext = tm_map(ctext, tolower) }
    if (ccase==2) { ctext = tm_map(ctext, toupper) }
    text = ctext
    #CONVERT FROM CORPUS IF NEEDED
    if (cflat>0) {
        text = NULL
        for (j in 1:length(ctext)) {
            temp = ctext[[j]]$content
            if (temp!="") { text = c(text,temp) }
        }
        text = as.array(text)
    }
    if (cflat==1) {
        text = paste(text,collapse="\n")
        text = str_replace_all(text, "[\r\n]" , " ")
    }
    result = text
}

Example

Now apply this function and see how we can get some clean text.

url = "http://srdas.github.io/research.htm"
res = read_web_page(url,0,0,0,1,1)
print(res)
## [1] "Data Science Theories Models Algorithms and Analytics web book  work in progress Derivatives Principles and Practice 2010 Rangarajan Sundaram and Sanjiv Das McGraw Hill An IndexBased Measure of Liquidity with George Chacko and Rong Fan 2016 Matrix Metrics NetworkBased Systemic Risk Scoring 2016 of systemic risk This paper won the First Prize in the MITCFP competition 2016 for  the best paper on SIFIs systemically important financial institutions  It also won the best paper award at  Credit Spreads with Dynamic Debt with Seoyoung Kim 2015  Text and Context Language Analytics for Finance 2014 Strategic Loan Modification An OptionsBased Response to Strategic Default Options and Structured Products in Behavioral Portfolios with Meir Statman 2013  and barrier range notes in the presence of fattailed outcomes using copulas Polishing Diamonds in the Rough The Sources of Syndicated Venture Performance 2011 with Hoje Jo and Yongtae Kim  Optimization with Mental Accounts 2010 with Harry Markowitz Jonathan Accountingbased versus marketbased crosssectional models of CDS spreads  with Paul Hanouna and Atulya Sarin 2009  Hedging Credit Equity Liquidity Matters with Paul Hanouna 2009 An Integrated Model for Hybrid Securities Yahoo for Amazon Sentiment Extraction from Small Talk on the Web Common Failings How Corporate Defaults are Correlated  with Darrell Duffie Nikunj Kapadia and Leandro Saita A Clinical Study of Investor Discussion and Sentiment  with Asis MartinezJerez and Peter Tufano 2005  International Portfolio Choice with Systemic Risk The loss resulting from diminished diversification is small while Speech Signaling Risksharing and the Impact of Fee Structures on investor welfare Contrary to regulatory intuition incentive structures A DiscreteTime Approach to Noarbitrage Pricing of Credit derivatives with Rating Transitions with Viral Acharya and Rangarajan Sundaram Pricing Interest Rate Derivatives A General Approachwith George Chacko A DiscreteTime Approach to ArbitrageFree Pricing of Credit Derivatives  The Psychology of Financial Decision Making A Case for TheoryDriven Experimental Enquiry 1999 with Priya Raghubir Of Smiles and Smirks A Term Structure Perspective A Theory of Banking Structure 1999 with Ashish Nanda by function based upon two dimensions the degree of information asymmetry  A Theory of Optimal Timing and Selectivity  A Direct DiscreteTime Approach to PoissonGaussian Bond Option Pricing in the HeathJarrowMorton  The Central Tendency A Second Factor in Bond Yields 1998 with Silverio Foresi and Pierluigi Balduzzi   Efficiency with Costly Information A Reinterpretation of Evidence from Managed Portfolios with Edwin Elton Martin Gruber and Matt  Presented and Reprinted in the Proceedings of The  Seminar on the Analysis of Security Prices at the Center  for Research in Security   Prices  at the University of  Coming up Short Managing Underfunded Portfolios in an LDIES Framework 2014  with Seoyoung Kim and Meir Statman   Going for Broke Restructuring Distressed Debt Portfolios 2014 Digital Portfolios 2013  Options on Portfolios with HigherOrder Moments 2009 options on a multivariate system of assets calibrated to the return  Dealing with Dimension Option Pricing on Factor Trees 2009 you to price options on multiple assets in a unified fraamework Computational Modeling Correlated Default with a Forest of Binomial Trees 2007 with Basel II Correlation Related Issues 2007  Correlated Default Risk 2006 with Laurence Freed Gary Geng and Nikunj Kapadia increase as markets worsen Regime switching models are needed to explain dynamic A Simple Model for Pricing Equity Options with Markov Switching State Variables 2006 with Donald Aingworth and Rajeev Motwani The Firms Management of Social Interactions 2005 with D Godes D Mayzlin Y Chen S Das C Dellarocas  B Pfeieffer B Libai S Sen M Shi and P Verlegh  Financial Communities with Jacob Sisk 2005  Summer 112123 Monte Carlo Markov Chain Methods for Derivative Pricing and Risk Assessmentwith Alistair Sinclair 2005  where incomplete information about the value of an asset may be exploited to  undertake fast and accurate pricing Proof that a fully polynomial randomized  Correlated Default Processes A CriterionBased Copula Approach Special Issue on Default Risk  Private Equity Returns An Empirical Examination of the Exit of VentureBacked Companies with Murali Jagannathan and Atulya Sarin firm being financed the valuation at the time of financing and the prevailing market sentiment Helps understand the risk premium required for the Issue on Computational Methods in Economics and Finance   December 5569 Bayesian Migration in Credit Ratings Based on Probabilities of The Impact of Correlated Default Risk on Credit Portfolios with Gifford Fong and Gary Geng How Diversified are Internationally Diversified Portfolios TimeVariation in the Covariances between International Returns DiscreteTime Bond and Option Pricing for JumpDiffusion Macroeconomic Implications of Search Theory for the Labor Market Auction Theory A Summary with Applications and Evidence from the Treasury Markets 1996 with Rangarajan Sundaram A Simple Approach to Three Factor Affine Models of the Term Structure with Pierluigi Balduzzi Silverio Foresi and Rangarajan Analytical Approximations of  the Term Structure for Jumpdiffusion Processes A Numerical Analysis 1996  Markov Chain Term Structure Models Extensions and Applications Exact Solutions for Bond and Options Prices with Systematic Jump Risk 1996 with Silverio Foresi Pricing Credit Sensitive Debt when Interest Rates Credit Ratings and Credit Spreads are Stochastic 1996  v52 161198 Portfolios for Investors Who Want to Reach Their Goals While Staying on the MeanVariance Efficient Frontier 2011  with Harry Markowitz Jonathan Scheid and Meir Statman  News Analytics Framework Techniques and Metrics The Handbook of News Analytics in Finance May 2011 John Wiley  Sons UK  Random Lattices for Option Pricing Problems in Finance 2011 Implementing Option Pricing Models using Python and Cython 2010 The Finance Web Internet Information and Markets 2010  Financial Applications with Parallel R 2009  Recovery Swaps 2009 with Paul Hanouna   Recovery Rates 2009with Paul Hanouna  A Simple Model for Pricing Securities with a DebtEquity Linkage 2008 in  Credit Default Swap Spreads 2006 with Paul Hanouna  MultipleCore Processors for Finance Applications 2006  Power Laws 2005 with Jacob Sisk  Genetic Algorithms 2005 Recovery Risk 2005 Venture Capital Syndication with Hoje Jo and Yongtae Kim 2004 Technical Analysis with David Tien 2004 Liquidity and the Bond Markets with Jan Ericsson and  Madhu Kalimipalli 2003 Modern Pricing of Interest Rate Derivatives  Book Review  Contagion 2003 Hedge Funds 2003 Reprinted in  Working Papers on Hedge Funds in The World of Hedge Funds  Characteristics and  Analysis 2005 World Scientific The Internet and Investors 2003   Useful things to know about Correlated Default Risk with Gifford Fong Laurence Freed Gary Geng and Nikunj Kapadia The Regulation of Fee Structures in Mutual Funds A Theoretical Analysis  with Rangarajan Sundaram 1998 NBER WP No 6639 in the Courant Institute of Mathematical Sciences special volume on A DiscreteTime Approach to ArbitrageFree Pricing of Credit Derivatives  with Rangarajan Sundaram reprinted in  the Courant Institute of Mathematical Sciences special volume on Stochastic Mean Models of the Term Structure with Pierluigi Balduzzi Silverio Foresi and Rangarajan Sundaram  John Wiley  Sons Inc 128161 Interest Rate Modeling with JumpDiffusion Processes  John Wiley  Sons Inc 162189 Comments on Pricing ExcessofLoss Reinsurance Contracts against Catastrophic Loss by J David Cummins C Lewis and Richard Phillips Froot Ed University of Chicago Press 1999 141145   Pricing Credit Derivatives  J Frost and JG Whittaker 101138 On the Recursive Implementation of Term Structure Models  Efficient Rebalancing of Taxable Portfolios with Dan Ostrov Dennis Ding Vincent Newell  Rollover Risk and Capital Structure Covenants in Structured Finance Vehicles  with Seoyoung Kim  Liability Directed Investing in a Behavioral Portfolio Theory Framework  with Seoyoung Kim and Meir Statman  The Fast and the Curious VC Drift  with Amit Bubna and Paul Hanouna  Venture Capital Communities with Amit Bubna and Nagpurnanand Prabhala                                                  "

Mood Scoring Text

Now we will take a different page of text and mood score it.

#EXAMPLE OF MOOD SCORING
library(stringr)
url = "http://srdas.github.io/bio-candid.html"
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=1,cflat=1)
print(text)
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara Universitys Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds postgraduate degrees in Finance MPhil and PhD from New York University Computer Science MS from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad BCom in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management coeditor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the AsiaPacific region as a VicePresident at Citibank His current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over eighty articles in academic journals and has won numerous awards for research and teaching His recent book Derivatives Principles and Practice was published in May 2010  He currently also serves as a Senior Fellow at the FDIC Center for Financial Research After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research  He graduated in 1994 with a PhD from NYU and since then spent five years in Boston and now lives in San Jose California  Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank NA in the AsiaPacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse Sanjivs research style is instilled with a distinct New York state of mind  it is chaotic diverse with minimal method to the madness He has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley Coastal living did a lot to mold Sanjiv who needs to live near the ocean  The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education  you can check out any time you like but you can never leave Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good"
text = str_replace_all(text,"nbsp"," ")
text
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara Universitys Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds postgraduate degrees in Finance MPhil and PhD from New York University Computer Science MS from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad BCom in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management coeditor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the AsiaPacific region as a VicePresident at Citibank His current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over eighty articles in academic journals and has won numerous awards for research and teaching His recent book Derivatives Principles and Practice was published in May 2010  He currently also serves as a Senior Fellow at the FDIC Center for Financial Research After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research  He graduated in 1994 with a PhD from NYU and since then spent five years in Boston and now lives in San Jose California  Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank NA in the AsiaPacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse Sanjivs research style is instilled with a distinct New York state of mind  it is chaotic diverse with minimal method to the madness He has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley Coastal living did a lot to mold Sanjiv who needs to live near the ocean  The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education  you can check out any time you like but you can never leave Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good"
text = unlist(strsplit(text," "))
print(text)
##   [1] "Sanjiv"         "Das"            "is"             "the"           
##   [5] "William"        "and"            "Janice"         "Terry"         
##   [9] "Professor"      "of"             "Finance"        "at"            
##  [13] "Santa"          "Clara"          "Universitys"    "Leavey"        
##  [17] "School"         "of"             "Business"       "He"            
##  [21] "previously"     "held"           "faculty"        "appointments"  
##  [25] "as"             "Associate"      "Professor"      "at"            
##  [29] "Harvard"        "Business"       "School"         "and"           
##  [33] "UC"             "Berkeley"       "He"             "holds"         
##  [37] "postgraduate"   "degrees"        "in"             "Finance"       
##  [41] "MPhil"          "and"            "PhD"            "from"          
##  [45] "New"            "York"           "University"     "Computer"      
##  [49] "Science"        "MS"             "from"           "UC"            
##  [53] "Berkeley"       "an"             "MBA"            "from"          
##  [57] "the"            "Indian"         "Institute"      "of"            
##  [61] "Management"     "Ahmedabad"      "BCom"           "in"            
##  [65] "Accounting"     "and"            "Economics"      "University"    
##  [69] "of"             "Bombay"         "Sydenham"       "College"       
##  [73] "and"            "is"             "also"           "a"             
##  [77] "qualified"      "Cost"           "and"            "Works"         
##  [81] "Accountant"     "He"             "is"             "a"             
##  [85] "senior"         "editor"         "of"             "The"           
##  [89] "Journal"        "of"             "Investment"     "Management"    
##  [93] "coeditor"       "of"             "The"            "Journal"       
##  [97] "of"             "Derivatives"    "and"            "The"           
## [101] "Journal"        "of"             "Financial"      "Services"      
## [105] "Research"       "and"            "Associate"      "Editor"        
## [109] "of"             "other"          "academic"       "journals"      
## [113] "Prior"          "to"             "being"          "an"            
## [117] "academic"       "he"             "worked"         "in"            
## [121] "the"            "derivatives"    "business"       "in"            
## [125] "the"            "AsiaPacific"    "region"         "as"            
## [129] "a"              "VicePresident"  "at"             "Citibank"      
## [133] "His"            "current"        "research"       "interests"     
## [137] "include"        "the"            "modeling"       "of"            
## [141] "default"        "risk"           "machine"        "learning"      
## [145] "social"         "networks"       "derivatives"    "pricing"       
## [149] "models"         "portfolio"      "theory"         "and"           
## [153] "venture"        "capital"        "He"             "has"           
## [157] "published"      "over"           "eighty"         "articles"      
## [161] "in"             "academic"       "journals"       "and"           
## [165] "has"            "won"            "numerous"       "awards"        
## [169] "for"            "research"       "and"            "teaching"      
## [173] "His"            "recent"         "book"           "Derivatives"   
## [177] "Principles"     "and"            "Practice"       "was"           
## [181] "published"      "in"             "May"            "2010"          
## [185] ""               "He"             "currently"      "also"          
## [189] "serves"         "as"             "a"              "Senior"        
## [193] "Fellow"         "at"             "the"            "FDIC"          
## [197] "Center"         "for"            "Financial"      "Research"      
## [201] "After"          "loafing"        "and"            "working"       
## [205] "in"             "many"           "parts"          "of"            
## [209] "Asia"           "but"            "never"          "really"        
## [213] "growing"        "up"             "Sanjiv"         "moved"         
## [217] "to"             "New"            "York"           "to"            
## [221] "change"         "the"            "world"          "hopefully"     
## [225] "through"        "research"       ""               "He"            
## [229] "graduated"      "in"             "1994"           "with"          
## [233] "a"              "PhD"            "from"           "NYU"           
## [237] "and"            "since"          "then"           "spent"         
## [241] "five"           "years"          "in"             "Boston"        
## [245] "and"            "now"            "lives"          "in"            
## [249] "San"            "Jose"           "California"     ""              
## [253] "Sanjiv"         "loves"          "animals"        "places"        
## [257] "in"             "the"            "world"          "where"         
## [261] "the"            "mountains"      "meet"           "the"           
## [265] "sea"            "riding"         "sport"          "motorbikes"    
## [269] "reading"        "gadgets"        "science"        "fiction"       
## [273] "movies"         "and"            "writing"        "cool"          
## [277] "software"       "code"           "When"           "there"         
## [281] "is"             "time"           "available"      "from"          
## [285] "the"            "excitement"     "of"             "daily"         
## [289] "life"           "Sanjiv"         "writes"         "academic"      
## [293] "papers"         "which"          "helps"          "him"           
## [297] "relax"          "Always"         "the"            "contrarian"    
## [301] "Sanjiv"         "thinks"         "that"           "New"           
## [305] "York"           "City"           "is"             "the"           
## [309] "most"           "calming"        "place"          "in"            
## [313] "the"            "world"          "after"          "California"    
## [317] "of"             "course"         "Sanjiv"         "is"            
## [321] "now"            "a"              "Professor"      "of"            
## [325] "Finance"        "at"             "Santa"          "Clara"         
## [329] "University"     "He"             "came"           "to"            
## [333] "SCU"            "from"           "Harvard"        "Business"      
## [337] "School"         "and"            "spent"          "a"             
## [341] "year"           "at"             "UC"             "Berkeley"      
## [345] "In"             "his"            "past"           "life"          
## [349] "in"             "the"            "unreal"         "world"         
## [353] "Sanjiv"         "worked"         "at"             "Citibank"      
## [357] "NA"             "in"             "the"            "AsiaPacific"   
## [361] "region"         "He"             "takes"          "great"         
## [365] "pleasure"       "in"             "merging"        "his"           
## [369] "many"           "previous"       "lives"          "into"          
## [373] "his"            "current"        "existence"      "which"         
## [377] "is"             "incredibly"     "confused"       "and"           
## [381] "diverse"        "Sanjivs"        "research"       "style"         
## [385] "is"             "instilled"      "with"           "a"             
## [389] "distinct"       "New"            "York"           "state"         
## [393] "of"             "mind"           ""               "it"            
## [397] "is"             "chaotic"        "diverse"        "with"          
## [401] "minimal"        "method"         "to"             "the"           
## [405] "madness"        "He"             "has"            "published"     
## [409] "articles"       "on"             "derivatives"    "termstructure" 
## [413] "models"         "mutual"         "funds"          "the"           
## [417] "internet"       "portfolio"      "choice"         "banking"       
## [421] "models"         "credit"         "risk"           "and"           
## [425] "has"            "unpublished"    "articles"       "in"            
## [429] "many"           "other"          "areas"          "Some"          
## [433] "years"          "ago"            "he"             "took"          
## [437] "time"           "off"            "to"             "get"           
## [441] "another"        "degree"         "in"             "computer"      
## [445] "science"        "at"             "Berkeley"       "confirming"    
## [449] "that"           "an"             "unchecked"      "hobby"         
## [453] "can"            "quickly"        "become"         "an"            
## [457] "obsession"      "There"          "he"             "learnt"        
## [461] "about"          "the"            "fascinating"    "field"         
## [465] "of"             "Randomized"     "Algorithms"     "skills"        
## [469] "he"             "now"            "applies"        "earnestly"     
## [473] "to"             "his"            "editorial"      "work"          
## [477] "and"            "other"          "pursuits"       "many"          
## [481] "of"             "which"          "stem"           "from"          
## [485] "being"          "in"             "the"            "epicenter"     
## [489] "of"             "Silicon"        "Valley"         "Coastal"       
## [493] "living"         "did"            "a"              "lot"           
## [497] "to"             "mold"           "Sanjiv"         "who"           
## [501] "needs"          "to"             "live"           "near"          
## [505] "the"            "ocean"          ""               "The"           
## [509] "many"           "walks"          "in"             "Greenwich"     
## [513] "village"        "convinced"      "him"            "that"          
## [517] "there"          "is"             "no"             "such"          
## [521] "thing"          "as"             "a"              "representative"
## [525] "investor"       "yet"            "added"          "many"          
## [529] "unique"         "features"       "to"             "his"           
## [533] "personal"       "utility"        "function"       "He"            
## [537] "learnt"         "that"           "it"             "is"            
## [541] "important"      "to"             "open"           "the"           
## [545] "academic"       "door"           "to"             "the"           
## [549] "ivory"          "tower"          "and"            "let"           
## [553] "the"            "world"          "in"             "Academia"      
## [557] "is"             "a"              "real"           "challenge"     
## [561] "given"          "that"           "he"             "has"           
## [565] "to"             "reconcile"      "many"           "more"          
## [569] "opinions"       "than"           "ideas"          "He"            
## [573] "has"            "been"           "known"          "to"            
## [577] "have"           "turned"         "down"           "many"          
## [581] "offers"         "from"           "Mad"            "magazine"      
## [585] "to"             "publish"        "his"            "academic"      
## [589] "work"           "As"             "he"             "often"         
## [593] "explains"       "you"            "never"          "really"        
## [597] "finish"         "your"           "education"      ""              
## [601] "you"            "can"            "check"          "out"           
## [605] "any"            "time"           "you"            "like"          
## [609] "but"            "you"            "can"            "never"         
## [613] "leave"          "Which"          "is"             "why"           
## [617] "he"             "is"             "doomed"         "to"            
## [621] "a"              "lifetime"       "in"             "Hotel"         
## [625] "California"     "And"            "he"             "believes"      
## [629] "that"           "if"             "this"           "is"            
## [633] "as"             "bad"            "as"             "it"            
## [637] "gets"           "life"           "is"             "really"        
## [641] "pretty"         "good"
posmatch = match(text,poswords)
numposmatch = length(posmatch[which(posmatch>0)])
negmatch = match(text,negwords)
numnegmatch = length(negmatch[which(negmatch>0)])
print(c(numposmatch,numnegmatch))
## [1] 26 16
#FURTHER EXPLORATION OF THESE OBJECTS
print(length(text))
## [1] 642
print(posmatch)
##   [1]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [15]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [29]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [43]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [57]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [71]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [85]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [99]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [113]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [127]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [141]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [155]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [169]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [183]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [197]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [211]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [225]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [239]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [253]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  994   NA   NA   NA
## [267]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [281]   NA   NA   NA   NA   NA  611   NA   NA   NA   NA   NA   NA   NA   NA
## [295]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [309]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [323]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [337]   NA   NA   NA   NA   NA   NA   NA   NA   NA  800   NA   NA   NA   NA
## [351]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  761
## [365] 1144   NA   NA  800   NA   NA   NA   NA  800   NA   NA   NA   NA   NA
## [379]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  515   NA   NA   NA
## [393]   NA 1011   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [407]   NA   NA   NA   NA   NA   NA   NA 1036   NA   NA   NA   NA   NA   NA
## [421]   NA  455   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [435]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [449]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [463]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  800   NA   NA
## [477]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [491]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  941   NA
## [505]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [519]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 1571   NA   NA  800
## [533]   NA   NA   NA   NA   NA   NA   NA   NA  838   NA 1076   NA   NA   NA
## [547]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 1255   NA
## [561]   NA   NA   NA   NA   NA 1266   NA   NA   NA   NA   NA   NA   NA   NA
## [575]   NA   NA  781   NA   NA   NA   NA   NA   NA   NA   NA   NA  800   NA
## [589]   NA   NA   NA   NA   NA   NA   NA   NA   NA 1645  542   NA   NA   NA
## [603]   NA   NA   NA   NA   NA  940   NA   NA   NA   NA   NA   NA   NA   NA
## [617]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [631]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 1184  747
print(text[77])
## [1] "qualified"
print(poswords[204])
## [1] "back"
is.na(posmatch)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [111]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [144]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [155]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [166]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [177]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [188]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [199]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [210]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [221]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [232]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [243]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [254]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [276]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [287]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [298]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [309]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [320]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [331]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [342]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [353]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [364] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [375]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [386]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [397]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [408]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [419]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [430]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [441]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [452]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [463]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [474] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [485]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [496]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [507]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [518]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [529] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [540]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [551]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [562]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [573]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [584]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [595]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [606]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [617]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [628]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [639]  TRUE  TRUE FALSE FALSE

Language Detection

We may be scraping web sites from many countries and need to detect the language and then translate it into English for mood scoring. The useful package textcat enables us to categorize the language.

library(textcat)
text = c("Je suis un programmeur novice.",
         "I am a programmer who is a novice.",
         "Sono un programmatore alle prime armi.",
         "Ich bin ein Anfänger Programmierer",
         "Soy un programador con errores.")

lang = textcat(text)
print(lang)
## [1] "french"  "english" "italian" "german"  "spanish"

Language Translation

And of course, once the language is detected, we may translate it into English.

library(translate)
set.key("AIzaSyDIB8qQTmhLlbPNN38Gs4dXnlN4a7lRrHQ")
print(translate(text[1],"fr","en"))
## [[1]]
## [1] "I am a novice programmer."
print(translate(text[3],"it","en"))
## [[1]]
## [1] "I&#39;m a novice programmer."
print(translate(text[4],"de","en"))
## [[1]]
## [1] "I am a beginner programmer"
print(translate(text[5],"es","en"))
## [[1]]
## [1] "I&#39;m a programmer errors."

This requires a Google API for which you need to set up a paid account.

Text Classification

  1. Machine classification is, from a layman’s point of view, nothing but learning by example. In new-fangled modern parlance, it is a technique in the field of “machine learning”.

  2. Learning by machines falls into two categories, supervised and unsupervised. When a number of explanatory \(X\) variables are used to determine some outcome \(Y\), and we train an algorithm to do this, we are performing supervised (machine) learning. The outcome \(Y\) may be a dependent variable (for example, the left hand side in a linear regression), or a classification (i.e., discrete outcome).

  3. When we only have \(X\) variables and no separate outcome variable \(Y\), we perform unsupervised learning. For example, cluster analysis produces groupings based on the \(X\) variables of various entities, and is a common example.

Classification Algorithms

We start with a simple example on numerical data befoe discussing how this is to be applied to text. We first look at the Bayes classifier.

Bayes Classifier - 1

Bayes classification extends the Document-Term model with a document-term-classification model. These are the three entities in the model and we denote them as \((d,t,c)\). Assume that there are \(D\) documents to classify into \(C\) categories, and we employ a dictionary/lexicon (as the case may be) of \(T\) terms or words. Hence we have \(d_i, i = 1, ... , D\), and \(t_j, j = 1, ... , T\). And correspondingly the categories for classification are \(c_k, k = 1, ... , C\).

Bayes Classifier - 2

Suppose we are given a text corpus of stock market related documents (tweets for example), and wish to classify them into bullish (\(c_1\)), neutral (\(c_2\)), or bearish (\(c_3\)), where \(C=3\). We first need to train the Bayes classifier using a training data set, with pre-classified documents, numbering \(D\). For each term \(t\) in the lexicon, we can compute how likely it is to appear in documents in each class \(c_k\). Therefore, for each class, there is a \(T\)-sided dice with each face representing a term and having a probability of coming up. These dice are the prior probabilities of seeing a word for each class of document. We denote these probabilities succinctly as \(p(t | c)\). For example in a bearish document, if the word “sell” comprises 10% of the words that appear, then \(p(t=\mbox{sell} | c=\mbox{bearish})=0.10\).

Bayes Classifier - 3

In order to ensure that just because a word does not appear in a class, it has a non-zero probability we compute the probabilities as follows:

\[ \begin{equation} p(t | c) = \frac{n(t | c) + 1}{n(c)+T} \end{equation} \]

where \(n(t | c)\) is the number of times word \(t\) appears in category \(c\), and \(n(c) = \sum_t n(t | c)\) is the total number of words in the training data in class \(c\). Note that if there are no words in the class \(c\), then each term \(t\) has probability \(1/T\).

Bayes Classifier - 4

A document \(d_i\) is a collection or set of words \(t_j\). The probability of seeing a given document in each category is given by the following multinomial probability:

\[ \begin{equation} p(d | c) = \frac{n(d)!}{n(t_1|d)! \cdot n(t_2|d)! \cdots n(t_T|d)!} \times p(t_1 | c) \cdot p(t_2 | c) \cdots p(t_T | c) \nonumber \end{equation} \]

where \(n(d)\) is the number of words in the document, and \(n(t_j | d)\) is the number of occurrences of word \(t_j\) in the same document \(d\). These \(p(d | c)\) are the prior probabilities in the Bayes classifier, computed from all documents in the training data. The posterior probabilities are computed for each document in the test data as follows:

\[ \begin{equation} p(c | d) = \frac{p(d | c) p(c)}{\sum_k \; p(d | c_k) p(c_k)}, \forall k = 1, \ldots, C \nonumber \end{equation} \]

Note that we get \(C\) posterior probabilities for document \(d\), and assign the document to class \(\max_k c_k\), i.e., the class with the highest posterior probability for the given document.

Naive Bayes in R

We use the e1071 package. It has a one-line command that takes in the tagged training dataset using the function naiveBayes(). It returns the trained classifier model.

The trained classifier contains the unconditional probabilities \(p(c)\) of each class, which are merely frequencies with which each document appears. It also shows the conditional probability distributions \(p(t |c)\) given as the mean and standard deviation of the occurrence of these terms in each class. We may take this trained model and re-apply to the training data set to see how well it does. We use the predict() function for this. The data set here is the classic Iris data.

For text mining, the feature set in the data will be the set of all words, and there will be one column for each word. Hence, this will be a large feature set. In order to keep this small, we may instead reduce the number of words by only using a lexicon’s words as the set of features. This will vastly reduce and make more specific the feature set used in the classifier.

Example

library(e1071)
data(iris)
print(head(iris))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
#NAIVE BAYES
res = naiveBayes(iris[,1:4],iris[,5])
#SHOWS THE PRIOR AND LIKELIHOOD FUNCTIONS
res
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = iris[, 1:4], y = iris[, 5])
## 
## A-priori probabilities:
## iris[, 5]
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Conditional probabilities:
##             Sepal.Length
## iris[, 5]     [,1]      [,2]
##   setosa     5.006 0.3524897
##   versicolor 5.936 0.5161711
##   virginica  6.588 0.6358796
## 
##             Sepal.Width
## iris[, 5]     [,1]      [,2]
##   setosa     3.428 0.3790644
##   versicolor 2.770 0.3137983
##   virginica  2.974 0.3224966
## 
##             Petal.Length
## iris[, 5]     [,1]      [,2]
##   setosa     1.462 0.1736640
##   versicolor 4.260 0.4699110
##   virginica  5.552 0.5518947
## 
##             Petal.Width
## iris[, 5]     [,1]      [,2]
##   setosa     0.246 0.1053856
##   versicolor 1.326 0.1977527
##   virginica  2.026 0.2746501
#SHOWS POSTERIOR PROBABILITIES
predict(res,iris[,1:4],type="raw")
##               setosa   versicolor    virginica
##   [1,]  1.000000e+00 2.981309e-18 2.152373e-25
##   [2,]  1.000000e+00 3.169312e-17 6.938030e-25
##   [3,]  1.000000e+00 2.367113e-18 7.240956e-26
##   [4,]  1.000000e+00 3.069606e-17 8.690636e-25
##   [5,]  1.000000e+00 1.017337e-18 8.885794e-26
##   [6,]  1.000000e+00 2.717732e-14 4.344285e-21
##   [7,]  1.000000e+00 2.321639e-17 7.988271e-25
##   [8,]  1.000000e+00 1.390751e-17 8.166995e-25
##   [9,]  1.000000e+00 1.990156e-17 3.606469e-25
##  [10,]  1.000000e+00 7.378931e-18 3.615492e-25
##  [11,]  1.000000e+00 9.396089e-18 1.474623e-24
##  [12,]  1.000000e+00 3.461964e-17 2.093627e-24
##  [13,]  1.000000e+00 2.804520e-18 1.010192e-25
##  [14,]  1.000000e+00 1.799033e-19 6.060578e-27
##  [15,]  1.000000e+00 5.533879e-19 2.485033e-25
##  [16,]  1.000000e+00 6.273863e-17 4.509864e-23
##  [17,]  1.000000e+00 1.106658e-16 1.282419e-23
##  [18,]  1.000000e+00 4.841773e-17 2.350011e-24
##  [19,]  1.000000e+00 1.126175e-14 2.567180e-21
##  [20,]  1.000000e+00 1.808513e-17 1.963924e-24
##  [21,]  1.000000e+00 2.178382e-15 2.013989e-22
##  [22,]  1.000000e+00 1.210057e-15 7.788592e-23
##  [23,]  1.000000e+00 4.535220e-20 3.130074e-27
##  [24,]  1.000000e+00 3.147327e-11 8.175305e-19
##  [25,]  1.000000e+00 1.838507e-14 1.553757e-21
##  [26,]  1.000000e+00 6.873990e-16 1.830374e-23
##  [27,]  1.000000e+00 3.192598e-14 1.045146e-21
##  [28,]  1.000000e+00 1.542562e-17 1.274394e-24
##  [29,]  1.000000e+00 8.833285e-18 5.368077e-25
##  [30,]  1.000000e+00 9.557935e-17 3.652571e-24
##  [31,]  1.000000e+00 2.166837e-16 6.730536e-24
##  [32,]  1.000000e+00 3.940500e-14 1.546678e-21
##  [33,]  1.000000e+00 1.609092e-20 1.013278e-26
##  [34,]  1.000000e+00 7.222217e-20 4.261853e-26
##  [35,]  1.000000e+00 6.289348e-17 1.831694e-24
##  [36,]  1.000000e+00 2.850926e-18 8.874002e-26
##  [37,]  1.000000e+00 7.746279e-18 7.235628e-25
##  [38,]  1.000000e+00 8.623934e-20 1.223633e-26
##  [39,]  1.000000e+00 4.612936e-18 9.655450e-26
##  [40,]  1.000000e+00 2.009325e-17 1.237755e-24
##  [41,]  1.000000e+00 1.300634e-17 5.657689e-25
##  [42,]  1.000000e+00 1.577617e-15 5.717219e-24
##  [43,]  1.000000e+00 1.494911e-18 4.800333e-26
##  [44,]  1.000000e+00 1.076475e-10 3.721344e-18
##  [45,]  1.000000e+00 1.357569e-12 1.708326e-19
##  [46,]  1.000000e+00 3.882113e-16 5.587814e-24
##  [47,]  1.000000e+00 5.086735e-18 8.960156e-25
##  [48,]  1.000000e+00 5.012793e-18 1.636566e-25
##  [49,]  1.000000e+00 5.717245e-18 8.231337e-25
##  [50,]  1.000000e+00 7.713456e-18 3.349997e-25
##  [51,] 4.893048e-107 8.018653e-01 1.981347e-01
##  [52,] 7.920550e-100 9.429283e-01 5.707168e-02
##  [53,] 5.494369e-121 4.606254e-01 5.393746e-01
##  [54,]  1.129435e-69 9.999621e-01 3.789964e-05
##  [55,] 1.473329e-105 9.503408e-01 4.965916e-02
##  [56,]  1.931184e-89 9.990013e-01 9.986538e-04
##  [57,] 4.539099e-113 6.592515e-01 3.407485e-01
##  [58,]  2.549753e-34 9.999997e-01 3.119517e-07
##  [59,]  6.562814e-97 9.895385e-01 1.046153e-02
##  [60,]  5.000210e-69 9.998928e-01 1.071638e-04
##  [61,]  7.354548e-41 9.999997e-01 3.143915e-07
##  [62,]  4.799134e-86 9.958564e-01 4.143617e-03
##  [63,]  4.631287e-60 9.999925e-01 7.541274e-06
##  [64,] 1.052252e-103 9.850868e-01 1.491324e-02
##  [65,]  4.789799e-55 9.999700e-01 2.999393e-05
##  [66,]  1.514706e-92 9.787587e-01 2.124125e-02
##  [67,]  1.338348e-97 9.899311e-01 1.006893e-02
##  [68,]  2.026115e-62 9.999799e-01 2.007314e-05
##  [69,] 6.547473e-101 9.941996e-01 5.800427e-03
##  [70,]  3.016276e-58 9.999913e-01 8.739959e-06
##  [71,] 1.053341e-127 1.609361e-01 8.390639e-01
##  [72,]  1.248202e-70 9.997743e-01 2.256698e-04
##  [73,] 3.294753e-119 9.245812e-01 7.541876e-02
##  [74,]  1.314175e-95 9.979398e-01 2.060233e-03
##  [75,]  3.003117e-83 9.982736e-01 1.726437e-03
##  [76,]  2.536747e-92 9.865372e-01 1.346281e-02
##  [77,] 1.558909e-111 9.102260e-01 8.977398e-02
##  [78,] 7.014282e-136 7.989607e-02 9.201039e-01
##  [79,]  5.034528e-99 9.854957e-01 1.450433e-02
##  [80,]  1.439052e-41 9.999984e-01 1.601574e-06
##  [81,]  1.251567e-54 9.999955e-01 4.500139e-06
##  [82,]  8.769539e-48 9.999983e-01 1.742560e-06
##  [83,]  3.447181e-62 9.999664e-01 3.361987e-05
##  [84,] 1.087302e-132 6.134355e-01 3.865645e-01
##  [85,]  4.119852e-97 9.918297e-01 8.170260e-03
##  [86,] 1.140835e-102 8.734107e-01 1.265893e-01
##  [87,] 2.247339e-110 7.971795e-01 2.028205e-01
##  [88,]  4.870630e-88 9.992978e-01 7.022084e-04
##  [89,]  2.028672e-72 9.997620e-01 2.379898e-04
##  [90,]  2.227900e-69 9.999461e-01 5.390514e-05
##  [91,]  5.110709e-81 9.998510e-01 1.489819e-04
##  [92,]  5.774841e-99 9.885399e-01 1.146006e-02
##  [93,]  5.146736e-66 9.999591e-01 4.089540e-05
##  [94,]  1.332816e-34 9.999997e-01 2.716264e-07
##  [95,]  6.094144e-77 9.998034e-01 1.966331e-04
##  [96,]  1.424276e-72 9.998236e-01 1.764463e-04
##  [97,]  8.302641e-77 9.996692e-01 3.307548e-04
##  [98,]  1.835520e-82 9.988601e-01 1.139915e-03
##  [99,]  5.710350e-30 9.999997e-01 3.094739e-07
## [100,]  3.996459e-73 9.998204e-01 1.795726e-04
## [101,] 3.993755e-249 1.031032e-10 1.000000e+00
## [102,] 1.228659e-149 2.724406e-02 9.727559e-01
## [103,] 2.460661e-216 2.327488e-07 9.999998e-01
## [104,] 2.864831e-173 2.290954e-03 9.977090e-01
## [105,] 8.299884e-214 3.175384e-07 9.999997e-01
## [106,] 1.371182e-267 3.807455e-10 1.000000e+00
## [107,] 3.444090e-107 9.719885e-01 2.801154e-02
## [108,] 3.741929e-224 1.782047e-06 9.999982e-01
## [109,] 5.564644e-188 5.823191e-04 9.994177e-01
## [110,] 2.052443e-260 2.461662e-12 1.000000e+00
## [111,] 8.669405e-159 4.895235e-04 9.995105e-01
## [112,] 4.220200e-163 3.168643e-03 9.968314e-01
## [113,] 4.360059e-190 6.230821e-06 9.999938e-01
## [114,] 6.142256e-151 1.423414e-02 9.857659e-01
## [115,] 2.201426e-186 1.393247e-06 9.999986e-01
## [116,] 2.949945e-191 6.128385e-07 9.999994e-01
## [117,] 2.909076e-168 2.152843e-03 9.978472e-01
## [118,] 1.347608e-281 2.872996e-12 1.000000e+00
## [119,] 2.786402e-306 1.151469e-12 1.000000e+00
## [120,] 2.082510e-123 9.561626e-01 4.383739e-02
## [121,] 2.194169e-217 1.712166e-08 1.000000e+00
## [122,] 3.325791e-145 1.518718e-02 9.848128e-01
## [123,] 6.251357e-269 1.170872e-09 1.000000e+00
## [124,] 4.415135e-135 1.360432e-01 8.639568e-01
## [125,] 6.315716e-201 1.300512e-06 9.999987e-01
## [126,] 5.257347e-203 9.507989e-06 9.999905e-01
## [127,] 1.476391e-129 2.067703e-01 7.932297e-01
## [128,] 8.772841e-134 1.130589e-01 8.869411e-01
## [129,] 5.230800e-194 1.395719e-05 9.999860e-01
## [130,] 7.014892e-179 8.232518e-04 9.991767e-01
## [131,] 6.306820e-218 1.214497e-06 9.999988e-01
## [132,] 2.539020e-247 4.668891e-10 1.000000e+00
## [133,] 2.210812e-201 2.000316e-06 9.999980e-01
## [134,] 1.128613e-128 7.118948e-01 2.881052e-01
## [135,] 8.114869e-151 4.900992e-01 5.099008e-01
## [136,] 7.419068e-249 1.448050e-10 1.000000e+00
## [137,] 1.004503e-215 9.743357e-09 1.000000e+00
## [138,] 1.346716e-167 2.186989e-03 9.978130e-01
## [139,] 1.994716e-128 1.999894e-01 8.000106e-01
## [140,] 8.440466e-185 6.769126e-06 9.999932e-01
## [141,] 2.334365e-218 7.456220e-09 1.000000e+00
## [142,] 2.179139e-183 6.352663e-07 9.999994e-01
## [143,] 1.228659e-149 2.724406e-02 9.727559e-01
## [144,] 3.426814e-229 6.597015e-09 1.000000e+00
## [145,] 2.011574e-232 2.620636e-10 1.000000e+00
## [146,] 1.078519e-187 7.915543e-07 9.999992e-01
## [147,] 1.061392e-146 2.770575e-02 9.722942e-01
## [148,] 1.846900e-164 4.398402e-04 9.995602e-01
## [149,] 1.439996e-195 3.384156e-07 9.999997e-01
## [150,] 2.771480e-143 5.987903e-02 9.401210e-01
#CONFUSION MATRIX
out = table(predict(res,iris[,1:4]),iris[,5])
out
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47

Support Vector Machines (SVM) - 1

The goal of the SVM is to map a set of entities with inputs \(X=\{x_1,x_2,\ldots,x_n\}\) of dimension \(n\), i.e., \(X \in R^n\), into a set of categories \(Y=\{y_1,y_2,\ldots,y_m\}\) of dimension \(m\), such that the \(n\)-dimensional \(X\)-space is divided using hyperplanes, which result in the maximal separation between classes \(Y\). A hyperplane is the set of points \({\bf x}\) satisfying the equation

\[ {\bf w} \cdot {\bf x} = b \]

where \(b\) is a scalar constant, and \({\bf w} \in R^n\) is the normal vector to the hyperplane, i.e., the vector at right angles to the plane. The distance between this hyperplane and \({\bf w} \cdot {\bf x} = 0\) is given by \(b/||{\bf w}||\), where \(||{\bf w}||\) is the norm of vector \({\bf w}\).

SVM - 2

This set up is sufficient to provide intuition about how the SVM is implemented. Suppose we have two categories of data, i.e., \(y = \{y_1, y_2\}\). Assume that all points in category \(y_1\) lie above a hyperplane \({\bf w} \cdot {\bf x} = b_1\), and all points in category \(y_2\) lie below a hyperplane \({\bf w} \cdot {\bf x} = b_2\), then the distance between the two hyperplanes is \(\frac{|b_1-b_2|}{||{\bf w}||}\).

#Example of hyperplane geometry
w1 = 1; w2 = 2
b1 = 10
#Plot hyperplane in x1, x2 space
x1 = seq(-3,3,0.1)
x2 = (b1-w1*x1)/w2
plot(x1,x2,type="l")
#Create hyperplane 2
b2 = 8
x2 = (b2-w1*x1)/w2
lines(x1,x2,col="red")

#Compute distance to hyperplane 2
print(abs(b1-b2)/sqrt(w1^2+w2^2))
## [1] 0.8944272

We see that this gives the perpendicular distance between the two parallel hyperplanes.

The goal of the SVM is to maximize the distance (separation) between the two hyperplanes, and this is achieved by minimizing norm \(||{\bf w}||\). This naturally leads to a quadratic optimization problem.

\[ \begin{equation} \min_{b_1,b_2,{\bf w}} \frac{1}{2} ||{\bf w}|| \end{equation} \]

subject to \({\bf w} \cdot {\bf x} \geq b_1\) for points in category \(y_1\) and \({\bf w} \cdot {\bf x} \leq b_2\) for points in category \(y_2\). Note that this program may find a solution where many of the elements of \({\bf w}\) are zero, i.e., it also finds the minimal set of “support” vectors that separate the two groups. The “half” in front of the minimand is for mathematical convenience in solving the quadratic program.

SVM - 3

Of course, there may be no linear hyperplane that perfectly separates the two groups. This slippage may be accounted for in the SVM by allowing for points on the wrong side of the separating hyperplanes using cost functions, i.e., we modify the quadratic program as follows:

\[ \begin{equation} \min_{b_1,b_2,{\bf w},\{\eta_i\}} \frac{1}{2} ||{\bf w}|| + C_1 \sum_{i=1}^n \eta_i + C_2 \sum_{i=1}^n \eta_i \end{equation} \] where \(C_1,C_2\) are the costs for slippage in groups 1 and 2, respectively. Often implementations assume \(C_1=C_2\). The values \(\eta_i\) are positive for observations that are not perfectly separated, i.e., lead to slippage. Thus, for group 1, these are the length of the perpendicular amounts by which observation \(i\) lies below the hyperplane \({\bf w} \cdot {\bf x} = b_1\), i.e., lies on the hyperplane \({\bf w} \cdot {\bf x} = b_1 - \eta_i\). For group 1, these are the length of the perpendicular amounts by which observation \(i\) lies above the hyperplane \({\bf w} \cdot {\bf x} = b_2\), i.e., lies on the hyperplane \({\bf w} \cdot {\bf x} = b_1 + \eta_i\). For observations within the respective hyperplanes, of course, \(\eta_i=0\).

Example of SVM with Confusion Matrix

library(e1071)

#EXAMPLE 1 for SVM
model = svm(iris[,1:4],iris[,5])
model
## 
## Call:
## svm.default(x = iris[, 1:4], y = iris[, 5])
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  51
out = predict(model,iris[,1:4])
out
##          1          2          3          4          5          6 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##          7          8          9         10         11         12 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         13         14         15         16         17         18 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         19         20         21         22         23         24 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         25         26         27         28         29         30 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         31         32         33         34         35         36 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         37         38         39         40         41         42 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         43         44         45         46         47         48 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         49         50         51         52         53         54 
##     setosa     setosa versicolor versicolor versicolor versicolor 
##         55         56         57         58         59         60 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         61         62         63         64         65         66 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         67         68         69         70         71         72 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         73         74         75         76         77         78 
## versicolor versicolor versicolor versicolor versicolor  virginica 
##         79         80         81         82         83         84 
## versicolor versicolor versicolor versicolor versicolor  virginica 
##         85         86         87         88         89         90 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         91         92         93         94         95         96 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         97         98         99        100        101        102 
## versicolor versicolor versicolor versicolor  virginica  virginica 
##        103        104        105        106        107        108 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        109        110        111        112        113        114 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        115        116        117        118        119        120 
##  virginica  virginica  virginica  virginica  virginica versicolor 
##        121        122        123        124        125        126 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        127        128        129        130        131        132 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        133        134        135        136        137        138 
##  virginica versicolor  virginica  virginica  virginica  virginica 
##        139        140        141        142        143        144 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        145        146        147        148        149        150 
##  virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
print(length(out))
## [1] 150
table(matrix(out),iris[,5])
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

So it does marginally better than naive Bayes. Here is another example.

Another example

#EXAMPLE 2 for SVM
train_data = matrix(rpois(60,3),10,6)
print(train_data)
##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]    4    2    4    2    3    3
##  [2,]    1    3    4    1    2    3
##  [3,]    4    0    5    4    0    3
##  [4,]    3    5    7    4    4    5
##  [5,]    4    4    1    2    5    2
##  [6,]    2    2    2    4    2    4
##  [7,]    4    4    7    3    0    2
##  [8,]    1    6    3    4    4    4
##  [9,]    2    3    0    3    5    2
## [10,]    4    5    4    2    2    5
train_class = as.matrix(c(2,3,1,2,2,1,3,2,3,3))
print(train_class)
##       [,1]
##  [1,]    2
##  [2,]    3
##  [3,]    1
##  [4,]    2
##  [5,]    2
##  [6,]    1
##  [7,]    3
##  [8,]    2
##  [9,]    3
## [10,]    3
library(e1071)
model = svm(train_data,train_class)
model
## 
## Call:
## svm.default(x = train_data, y = train_class)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1666667 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  10
pred = predict(model,train_data, type="raw")
table(pred,train_class)
##                   train_class
## pred               1 2 3
##   1.49183455115297 1 0 0
##   1.57907426923968 1 0 0
##   2.07880384949001 0 1 0
##   2.07882663762585 0 1 0
##   2.07897845681741 0 1 0
##   2.07899254666812 0 1 0
##   2.56858795453682 0 0 1
##   2.82067612740427 0 0 1
##   2.83375866541262 0 0 1
##   2.92104171473576 0 0 1
train_fitted = round(pred,0)
print(cbind(train_class,train_fitted))
##      train_fitted
## 1  2            2
## 2  3            3
## 3  1            1
## 4  2            2
## 5  2            2
## 6  1            2
## 7  3            3
## 8  2            2
## 9  3            3
## 10 3            3
train_fitted = matrix(train_fitted)
table(train_class,train_fitted)
##            train_fitted
## train_class 1 2 3
##           1 1 1 0
##           2 0 4 0
##           3 0 0 4

Statistical Significance of the Confusion Matrix

How do we know if the confusion matrix shows statistically significant classification power? We do a chi-square test.

library(e1071)
res = naiveBayes(iris[,1:4],iris[,5])
pred = predict(res,iris[,1:4])
out = table(pred,iris[,5])
out
##             
## pred         setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47
chisq.test(out)
## 
##  Pearson's Chi-squared test
## 
## data:  out
## X-squared = 266.16, df = 4, p-value < 2.2e-16

Word count classifiers, adjectives, and adverbs

  1. Given a lexicon of selected words, one may sign the words as positive or negative, and then do a simple word count to compute net sentiment or mood of text. By establishing appropriate cut offs, one can determine the classification of text into optimistic, neutral, or pessimistic. These cut offs are determined using the training and testing data sets.

  2. Word count classifiers may be enhanced by focusing on “emphasis words” such as adjectives and adverbs, especially when classifying emotive content. One approach used in Das and Chen (2007) is to identify all adjectives and adverbs in the text and then only consider words that are within \(\pm 3\) words before and after the adjective or adverb. This extracts the most emphatic parts of the text only, and then mood scores it.

Fisher’s discriminant

\[ \begin{equation} F(w) = \frac{\frac{1}{K} \sum_{j=1}^K ({\bar w}_j - {\bar w}_0)^2}{\frac{1}{K} \sum_{j=1}^K \sigma_j^2} \nonumber \end{equation} \]

where \(K\) is the number of categories and \({\bar w}_j\) is the mean occurrence of the word \(w\) in each text in category \(j\), and \({\bar w}_0\) is the mean occurrence across all categories. And \(\sigma_j^2\) is the variance of the word occurrence in category \(j\). This is just one way in which Fisher’s discriminant may be calculated, and there are other variations on the theme.

Vector-Distance Classifier

Suppose we have 500 documents in each of two categories, bullish and bearish. These 1,000 documents may all be placed as points in \(n\)-dimensional space. It is more than likely that the points in each category will lie closer to each other than to the points in the other category. Now, if we wish to classify a new document, with vector \(D_i\), the obvious idea is to look at which cluster it is closest to, or which point in either cluster it is closest to. The closeness between two documents \(i\) and \(j\) is determined easily by the well known metric of cosine distance, i.e.,

\[ \begin{equation} 1 - \cos(\theta_{ij}) = 1 - \frac{D_i^\top D_j}{||D_i|| \cdot ||D_j||} \nonumber \end{equation} \]

where \(||D_i|| = \sqrt{D_i^\top D_i}\) is the norm of the vector \(D_i\). The cosine of the angle between the two document vectors is 1 if the two vectors are identical, and in this case the distance between them would be zero.

Metrics: Confusion matrix

The confusion matrix is the classic tool for assessing classification accuracy. Given \(n\) categories, the matrix is of dimension \(n \times n\). The rows relate to the category assigned by the analytic algorithm and the columns refer to the correct category in which the text resides. Each cell \((i,j)\) of the matrix contains the number of text messages that were of type \(j\) and were classified as type \(i\). The cells on the diagonal of the confusion matrix state the number of times the algorithm got the classification right. All other cells are instances of classification error. If an algorithm has no classification ability, then the rows and columns of the matrix will be independent of each other. Under this null hypothesis, the statistic that is examined for rejection is as follows:

\[ \chi^2[dof=(n-1)^2] = \sum_{i=1}^n \sum_{j=1}^n \frac{[A(i,j) - E(i,j)]^2}{E(i,j)} \]

where \(A(i,j)\) are the actual numbers observed in the confusion matrix, and \(E(i,j)\) are the expected numbers, assuming no classification ability under the null. If \(T(i)\) represents the total across row \(i\) of the confusion matrix, and \(T(j)\) the column total, then

\[ E(i,j) = \frac{T(i) \times T(j)}{\sum_{i=1}^n T(i)} \equiv \frac{T(i) \times T(j)}{\sum_{j=1}^n T(j)} \]

The degrees of freedom of the \(\chi^2\) statistic is \((n-1)^2\). This statistic is very easy to implement and may be applied to models for any \(n\). A highly significant statistic is evidence of classification ability.

Accuracy

Algorithm accuracy over a classification scheme is the percentage of text that is correctly classified. This may be done in-sample or out-of-sample. To compute this off the confusion matrix, we calculate

\[ \mbox{Accuracy} = \frac{ \sum_{i=1}^K O(i,i)}{\sum_{j=1}^K M(j)} = \frac{ \sum_{i=1}^K O(i,i)}{\sum_{i=1}^K M(i)} \]

We should hope that this is at least greater than \(1/K\), which is the accuracy level achieved on average from random guessing.

Sentiment over Time

Stock Sentiment Correlations

Phase Lag Analysis

False Positives

  1. The percentage of false positives is a useful metric to work with. It may be calculated as a simple count or as a weighted count (by nearness of wrong category) of false classifications divided by total classifications undertaken.

  2. For example, assume that in the example above, category 1 is BULLISH and category 3 is BEARISH, whereas category 2 is NEUTRAL. The false positives would arise from mis-classifying category 1 as 3 and vice-versa. We compute the false positive rate for illustration.

  3. The false positive rate is just 1% in the example below.

Omatrix = matrix(c(22,1,0,3,44,3,1,1,25),3,3)
print((Omatrix[1,3]+Omatrix[3,1])/sum(Omatrix))
## [1] 0.01

Sentiment Error

In a 3-way classification scheme, where category 1 is BULLISH and category 3 is BEARISH, whereas category 2 is NEUTRAL, we can compute this metric as follows.

\[ \begin{equation} \mbox{Sentiment Error} = 1 - \frac{M(j=1)-M(j=3)}{M(i=1)-M(i=3)} \nonumber \end{equation} \]

In our illustrative example, we may easily calculate this metric. The classified sentiment from the algorithm was \(-3 = 23-27\), whereas it actually should have been \(-2 = 26-28\). The percentage error in sentiment is 50%.

print(Omatrix)
##      [,1] [,2] [,3]
## [1,]   22    3    1
## [2,]    1   44    1
## [3,]    0    3   25
rsum = rowSums(Omatrix)
csum = colSums(Omatrix)
print(rsum)
## [1] 26 46 28
print(csum)
## [1] 23 50 27
print(1 - (-3)/(-2))
## [1] -0.5

Disagreement

The metric uses the number of signed buys and sells in the day (based on a sentiment model) to determine how much difference of opinion there is in the market. The metric is computed as follows:

\[ \mbox{DISAG} = \left| 1 - \left| \frac{B-S}{B+S} \right| \right| \]

where \(B, S\) are the numbers of classified buys and sells. Note that DISAG is bounded between zero and one.

Using the true categories of buys (category 1 BULLISH) and sells (category 3 BEARISH) in the same example as before, we may compute disagreement. Since there is little agreement (26 buys and 28 sells), disagreement is high.

print(Omatrix)
##      [,1] [,2] [,3]
## [1,]   22    3    1
## [2,]    1   44    1
## [3,]    0    3   25
DISAG = abs(1-abs((26-28)/(26+28)))
print(DISAG)
## [1] 0.962963

Precision and Recall

The creation of the confusion matrix leads naturally to two measures that are associated with it.

Precision is the fraction of positives identified that are truly positive, and is also known as positive predictive value. It is a measure of usefulness of prediction. So if the algorithm (say) was tasked with selecting those account holders on LinkedIn who are actually looking for a job, and it identifies \(n\) such people of which only \(m\) were really looking for a job, then the precision would be \(m/n\).

Recall is the proportion of positives that are correctly identified, and is also known as sensitivity. It is a measure of how complete the prediction is. If the actual number of people looking for a job on LinkedIn was \(M\), then recall would be \(n/M\).

For example, suppose we have the following confusion matrix.

Actual
Predicted Looking for Job Not Looking
Looking for Job 10 2 12
Not Looking 1 16 17
11 18 29

In this case precision is \(10/12\) and recall is \(10/11\). Precision is related to the probability of false positives (Type I error), which is one minus precision. Recall is related to the probability of false negatives (Type II error), which is one minus recall.

Using the RTextTools package

This package bundles text classification algorithms into one package.

library(tm)
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## Attaching package: 'RTextTools'
## The following objects are masked from 'package:SnowballC':
## 
##     getStemLanguages, wordStem
#Create sample text with positive and negative markers
n = 1000
npos = round(runif(n,1,25))
nneg = round(runif(n,1,25))
flag = matrix(0,n,1)
flag[which(npos>nneg)] = 1
text = NULL
for (j in 1:n) {
  res = paste(c(sample(poswords,npos[j]),sample(negwords,nneg[j])),collapse=" ")
  text = c(text,res)
}

#Text Classification
m = create_matrix(text)
print(m)
## <<DocumentTermMatrix (documents: 1000, terms: 3707)>>
## Non-/sparse entries: 25749/3681251
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)
m = create_matrix(text,weighting=weightTfIdf)
print(m)
## <<DocumentTermMatrix (documents: 1000, terms: 3707)>>
## Non-/sparse entries: 25749/3681251
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
container <- create_container(m,flag,trainSize=1:(n/2), testSize=(n/2+1):n,virgin=FALSE)
#models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"))
models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","TREE"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)

#RESULTS
analytics@algorithm_summary # SUMMARY OF PRECISION, RECALL, F-SCORES, AND ACCURACY SORTED BY TOPIC CODE FOR EACH ALGORITHM
##   SVM_PRECISION SVM_RECALL SVM_FSCORE GLMNET_PRECISION GLMNET_RECALL
## 0          0.79       0.74       0.76             0.59          0.81
## 1          0.75       0.81       0.78             0.70          0.44
##   GLMNET_FSCORE TREE_PRECISION TREE_RECALL TREE_FSCORE
## 0          0.68           0.51        0.91        0.65
## 1          0.54           0.57        0.12        0.20
##   MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 0                 0.74              0.78              0.76
## 1                 0.76              0.73              0.74
analytics@label_summary # SUMMARY OF LABEL (e.g. TOPIC) ACCURACY
##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0                251                 386                   274
## 1                249                 114                   226
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0           153.78486             109.16335                      93.22709
## 1            45.78313              90.76305                      38.95582
##   PCT_CORRECTLY_CODED_PROBABILITY
## 0                        78.08765
## 1                        68.67470
analytics@document_summary # RAW SUMMARY OF ALL DATA AND SCORING
##     MAXENTROPY_LABEL MAXENTROPY_PROB SVM_LABEL  SVM_PROB GLMNET_LABEL
## 1                  1       0.8128985         1 0.7477271            1
## 2                  0       0.9994029         0 0.9575503            0
## 3                  1       0.8841554         1 0.8518500            0
## 4                  1       0.9944042         1 0.7407198            1
## 5                  1       0.9670682         1 0.7232802            1
## 6                  1       0.9989204         1 0.9326247            0
## 7                  0       0.9922064         0 0.7732143            0
## 8                  1       0.9654615         1 0.8032935            0
## 9                  1       0.9898728         1 0.8705422            1
## 10                 1       0.6596024         0 0.5437161            1
## 11                 1       0.6374276         0 0.5637331            1
## 12                 1       0.8088413         1 0.8034760            0
## 13                 1       0.9687153         1 0.8498515            0
## 14                 0       0.9703313         0 0.5937378            1
## 15                 0       0.5106975         0 0.5214603            0
## 16                 0       0.9959652         0 0.9277749            0
## 17                 1       0.9898641         1 0.5760921            0
## 18                 1       0.7994957         1 0.5226133            0
## 19                 1       0.9923767         1 0.7558341            0
## 20                 0       0.9963374         0 0.8710885            0
## 21                 1       0.8986928         1 0.7234946            1
## 22                 1       0.8337704         1 0.6618261            0
## 23                 0       0.9999773         0 0.9796217            1
## 24                 1       0.9701363         1 0.8062617            0
## 25                 0       0.9561668         0 0.8228493            0
## 26                 0       0.9920456         0 0.7782786            0
## 27                 1       0.9573274         1 0.6142472            0
## 28                 1       0.9675582         1 0.7962569            0
## 29                 1       0.9993184         1 0.8614660            0
## 30                 0       0.9877486         0 0.8463667            0
## 31                 0       0.8736243         0 0.7611957            0
## 32                 0       0.6344662         0 0.5223552            0
## 33                 1       0.9989904         1 0.9444935            1
## 34                 1       0.9980179         1 0.9297618            1
## 35                 0       0.9934277         0 0.9172730            1
## 36                 0       0.9929164         0 0.8672195            0
## 37                 0       0.9858067         0 0.8684471            0
## 38                 0       0.9967703         0 0.8814079            0
## 39                 0       0.9987181         0 0.9724483            0
## 40                 1       0.9839251         1 0.9355922            1
## 41                 1       0.7374481         1 0.7822381            1
## 42                 0       0.8551329         0 0.7298194            0
## 43                 1       0.7599289         1 0.5239015            0
## 44                 0       0.6806661         1 0.5000000            0
## 45                 1       0.9997091         1 0.9406043            1
## 46                 0       0.7549629         1 0.5331721            0
## 47                 1       0.9693582         1 0.9396171            1
## 48                 0       0.9939514         0 0.8756046            0
## 49                 0       0.8388264         0 0.7082167            0
## 50                 1       0.9970600         1 0.8671576            1
## 51                 1       0.9428984         1 0.7280938            0
## 52                 0       0.9997440         0 0.9665184            0
## 53                 0       0.9762014         0 0.7202795            0
## 54                 0       0.9985348         0 0.9237987            0
## 55                 0       0.8885191         0 0.6259608            0
## 56                 1       0.9445026         1 0.6707804            1
## 57                 0       0.9327551         1 0.5000000            1
## 58                 0       0.7968503         0 0.6860803            0
## 59                 1       0.9997488         1 0.9292209            1
## 60                 0       0.9301424         0 0.6671674            1
## 61                 0       0.6808879         0 0.5160386            0
## 62                 1       0.6656949         1 0.6806827            0
## 63                 1       0.9051844         1 0.5257806            1
## 64                 0       0.9932167         0 0.8035601            1
## 65                 0       0.5213249         1 0.7858596            0
## 66                 1       0.5357247         1 0.5223799            1
## 67                 0       0.9995241         0 0.9600918            0
## 68                 1       0.8019841         1 0.8494818            1
## 69                 1       0.9995614         1 0.8958818            0
## 70                 1       0.7446780         0 0.5194801            1
## 71                 1       0.7716160         1 0.7010697            0
## 72                 0       0.8523766         0 0.6991312            0
## 73                 0       0.9253682         0 0.7925627            0
## 74                 1       0.8781787         0 0.5720535            1
## 75                 1       0.9731435         1 0.8918941            1
## 76                 0       0.9965196         1 0.5138448            0
## 77                 0       0.9790316         0 0.7325714            0
## 78                 1       0.8467690         0 0.6997145            0
## 79                 1       0.5240234         0 0.5265153            0
## 80                 1       0.9956678         1 0.8205247            1
## 81                 0       0.9827347         0 0.8948757            0
## 82                 0       0.9998397         0 0.9606773            0
## 83                 0       0.9210204         0 0.8301172            0
## 84                 0       0.8898177         0 0.6360776            0
## 85                 0       0.6012967         0 0.6719716            1
## 86                 0       0.9268625         0 0.5774946            0
## 87                 0       0.8998142         1 0.5680135            0
## 88                 0       0.9665359         0 0.9250393            0
## 89                 0       0.9950601         0 0.8068803            0
## 90                 1       0.7325425         0 0.7276670            0
## 91                 0       0.5471800         1 0.5786405            1
## 92                 1       0.9427613         1 0.7111441            1
## 93                 0       0.9984813         0 0.9210619            0
## 94                 0       0.9903869         0 0.8971490            0
## 95                 1       0.9999986         1 0.9864772            1
## 96                 1       0.8145151         1 0.6606432            0
## 97                 1       0.7658036         1 0.6049924            0
## 98                 1       0.9182132         1 0.6880265            1
## 99                 1       0.9981160         1 0.9069236            0
## 100                0       0.7993881         1 0.6131009            1
## 101                1       0.9565344         1 0.7490043            1
## 102                1       0.6650576         1 0.6304966            0
## 103                0       0.5063549         1 0.5000000            0
## 104                0       0.7260377         0 0.6220326            0
## 105                1       0.8296563         1 0.5697256            0
## 106                0       0.8581268         0 0.6725471            0
## 107                1       0.9005181         1 0.8974063            0
## 108                0       0.5590510         1 0.5178110            0
## 109                0       0.9022235         0 0.7564028            1
## 110                0       0.7482033         0 0.6149431            0
## 111                1       0.9988620         1 0.9427812            1
## 112                1       0.9689348         0 0.5129173            0
## 113                1       0.8953959         1 0.6481548            1
## 114                0       0.9968940         0 0.9432780            0
## 115                1       0.6460545         1 0.6692391            0
## 116                0       0.9833456         0 0.6299335            1
## 117                0       0.9118761         1 0.8604302            1
## 118                1       0.9998324         1 0.9078153            0
## 119                0       0.5535702         1 0.5821412            1
## 120                0       0.5827588         0 0.5514501            0
## 121                1       0.9999944         1 0.9867531            1
## 122                0       0.6537523         1 0.7163307            1
## 123                1       0.9972034         1 0.7220134            0
## 124                1       0.9995296         1 0.9391763            1
## 125                1       0.8322974         1 0.6153373            0
## 126                0       0.9999949         0 0.9694848            0
## 127                1       0.9990772         1 0.8621969            1
## 128                1       0.9386510         1 0.8673417            1
## 129                1       0.9903196         1 0.8724989            1
## 130                0       0.7623796         0 0.5543861            0
## 131                0       0.9958236         0 0.8777639            0
## 132                1       0.8719154         1 0.6229240            0
## 133                1       0.9941677         1 0.7983631            0
## 134                0       0.9995906         0 0.9567863            0
## 135                1       0.9954453         1 0.9550945            1
## 136                0       0.5013833         1 0.5464660            1
## 137                1       0.9880241         1 0.7920550            1
## 138                0       0.9999979         0 0.9881448            0
## 139                1       0.9301504         1 0.8376803            1
## 140                1       0.9857629         1 0.8025009            0
## 141                1       0.9683327         1 0.5957441            1
## 142                1       0.9942656         1 0.9179661            1
## 143                0       0.7035816         0 0.7609529            0
## 144                0       0.7589324         1 0.5155630            0
## 145                0       0.5777178         0 0.6374749            0
## 146                0       0.6462308         0 0.5474078            0
## 147                1       0.9948484         1 0.9112858            0
## 148                1       0.9665183         1 0.7919915            0
## 149                0       0.9362211         0 0.6588428            0
## 150                1       0.9944232         1 0.9073412            1
## 151                1       0.9999235         1 0.8906657            1
## 152                0       0.9750741         0 0.8250910            0
## 153                0       0.8880219         0 0.7826170            0
## 154                0       0.9999308         0 0.9456314            0
## 155                0       0.9983554         0 0.8579591            0
## 156                1       0.9998990         1 0.8308149            0
## 157                0       0.6961788         0 0.5544803            0
## 158                1       0.9998276         1 0.8908183            0
## 159                0       0.8122799         0 0.7275137            0
## 160                1       0.9726847         1 0.6989201            1
## 161                1       0.7547728         1 0.7729913            1
## 162                0       0.6232491         0 0.6370918            0
## 163                1       0.7877234         0 0.5131600            1
## 164                0       0.7855512         0 0.7253842            0
## 165                0       0.5159015         0 0.7814032            0
## 166                0       0.5092280         1 0.5790079            1
## 167                0       0.9985181         0 0.7838887            0
## 168                1       0.9769882         1 0.6934495            0
## 169                0       0.9649751         0 0.7240176            0
## 170                1       0.6880233         0 0.5088707            0
## 171                0       0.5050760         0 0.5577567            0
## 172                0       0.5934099         1 0.5561894            0
## 173                1       0.5051749         1 0.7499614            1
## 174                1       0.7045128         1 0.7031793            1
## 175                1       0.7394669         1 0.6924005            0
## 176                1       0.5865656         1 0.5277826            0
## 177                1       0.5917299         1 0.5314588            0
## 178                0       0.7543881         0 0.7137238            0
## 179                0       0.9986024         0 0.7686486            0
## 180                1       0.9998012         1 0.9354996            1
## 181                0       0.9176487         0 0.6993425            0
## 182                1       0.8291850         1 0.6422053            0
## 183                1       0.9486831         1 0.7425172            0
## 184                1       0.9984585         1 0.9615689            1
## 185                1       0.9986400         1 0.9372804            1
## 186                1       0.9977648         1 0.8583378            0
## 187                0       0.5359480         1 0.7164408            0
## 188                1       0.5590231         0 0.5209708            1
## 189                1       0.7251138         1 0.6091527            0
## 190                0       0.9919851         0 0.7545487            0
## 191                0       0.5271147         1 0.5157553            0
## 192                0       0.7944627         1 0.7349025            0
## 193                1       0.7671650         1 0.5508305            0
## 194                0       0.8515744         0 0.8107470            0
## 195                0       0.9971905         0 0.8716436            1
## 196                0       0.9781120         0 0.8320561            0
## 197                0       0.9336950         0 0.7924070            0
## 198                0       0.7407429         1 0.5000000            0
## 199                1       0.9999984         1 0.6007777            0
## 200                0       0.9990395         0 0.9482064            0
## 201                0       0.8238369         1 0.5063313            0
## 202                1       0.9070229         1 0.7586176            0
## 203                0       0.7254005         0 0.5354525            1
## 204                0       0.9224864         0 0.7508186            0
## 205                1       0.9780501         1 0.5441811            1
## 206                0       0.9457504         0 0.7752804            0
## 207                1       0.7443015         0 0.5425703            0
## 208                1       0.9523549         1 0.7026285            0
## 209                1       0.5857516         1 0.5785889            0
## 210                1       0.7298929         1 0.7171452            0
## 211                0       0.7910843         1 0.5389055            1
## 212                1       0.9023352         1 0.6441911            1
## 213                0       0.9952811         1 0.5941013            1
## 214                0       0.9983226         0 0.8737786            0
## 215                0       0.9886532         0 0.6881907            0
## 216                0       0.9954326         0 0.8390559            1
## 217                0       0.9805979         0 0.8268154            0
## 218                0       0.9097496         0 0.6300567            0
## 219                0       0.9499024         1 0.5914116            0
## 220                0       0.9195491         0 0.6629643            0
## 221                1       0.9119327         1 0.8142266            1
## 222                0       0.8805880         0 0.5502538            1
## 223                1       0.9976208         1 0.8203757            1
## 224                0       0.8580226         0 0.5389772            1
## 225                0       0.9983346         0 0.9400183            0
## 226                1       0.8010072         1 0.6871148            1
## 227                1       0.7761497         1 0.5666177            1
## 228                0       0.9524410         0 0.7777861            0
## 229                0       0.7456341         0 0.5636089            0
## 230                1       0.9970708         1 0.7708723            0
## 231                1       0.7451311         1 0.6167894            1
## 232                0       0.7791542         0 0.7974689            0
## 233                1       0.8456182         1 0.7825672            1
## 234                0       0.8998767         0 0.7009498            0
## 235                0       0.9908314         0 0.5897571            1
## 236                0       0.9220584         0 0.5367924            0
## 237                1       0.9079610         1 0.6527462            0
## 238                1       0.9993306         1 0.9698138            1
## 239                1       0.9953971         1 0.6211661            0
## 240                0       0.9173502         0 0.6617049            0
## 241                0       0.5854528         0 0.6610598            0
## 242                0       0.9152434         0 0.5597828            0
## 243                0       0.9941803         0 0.7736273            0
## 244                0       0.9963055         0 0.8505750            0
## 245                0       0.9338779         1 0.5000000            0
## 246                1       0.7782416         1 0.5000000            0
## 247                1       0.9944806         1 0.9007876            1
## 248                1       0.9976997         1 0.8814903            0
## 249                1       0.9901030         1 0.7745626            0
## 250                1       0.9987814         1 0.7912278            1
## 251                1       0.5582029         1 0.7806575            1
## 252                0       0.9749167         0 0.8407736            0
## 253                1       0.9822757         1 0.7145851            0
## 254                0       0.9456951         0 0.7952853            0
## 255                0       0.7571824         0 0.5991020            1
## 256                0       0.5742905         0 0.6128120            1
## 257                0       0.9974032         0 0.8542493            0
## 258                1       0.9941373         1 0.7491310            0
## 259                1       0.9962501         1 0.8930351            1
## 260                1       0.9903032         1 0.5196341            0
## 261                1       0.7434228         1 0.6558614            0
## 262                0       0.9731966         0 0.8977182            0
## 263                1       0.9934960         1 0.8653191            0
## 264                0       0.6584605         0 0.6040838            0
## 265                1       0.9009149         1 0.7018958            0
## 266                0       0.7373388         0 0.5250890            0
## 267                1       0.6431493         1 0.6950449            0
## 268                1       0.7426433         1 0.6603771            1
## 269                0       0.9869195         0 0.8073168            0
## 270                1       0.9600692         1 0.7275844            0
## 271                0       0.9998658         0 0.9096190            1
## 272                0       0.9940796         0 0.8117161            0
## 273                0       0.9991737         0 0.9053893            0
## 274                0       0.9967507         0 0.8945669            0
## 275                0       0.9991585         0 0.9180828            1
## 276                0       0.9420480         0 0.7705584            0
## 277                0       0.9549072         0 0.5556725            0
## 278                0       0.9969435         0 0.8706034            0
## 279                1       0.9947791         1 0.8749368            1
## 280                0       0.8452477         0 0.6748101            0
## 281                0       0.7767490         1 0.6096169            0
## 282                0       0.9992423         0 0.9085480            0
## 283                0       0.9900504         0 0.7994949            0
## 284                1       0.9999201         1 0.9978600            1
## 285                0       0.6363413         0 0.5192603            0
## 286                0       0.7110731         1 0.6799780            1
## 287                0       0.9541139         0 0.8295179            0
## 288                1       0.9916998         1 0.7129795            1
## 289                0       0.9838223         0 0.7099468            1
## 290                0       0.9939969         0 0.9184316            0
## 291                1       0.9178187         1 0.5565172            0
## 292                0       0.9571169         0 0.6468213            1
## 293                1       0.7308985         1 0.5842116            0
## 294                0       0.7804228         1 0.5000000            0
## 295                1       0.5194434         1 0.8261068            1
## 296                0       0.8695216         0 0.7504121            1
## 297                0       0.9483468         1 0.5175318            0
## 298                1       0.8779710         1 0.8194471            0
## 299                0       0.5311403         1 0.5263268            0
## 300                1       0.9482938         1 0.5542939            0
## 301                1       0.8638649         1 0.6559819            0
## 302                1       0.9953375         1 0.8078920            1
## 303                1       0.9936536         1 0.6412526            0
## 304                1       0.9767917         1 0.9320293            1
## 305                1       0.9985347         1 0.9334050            1
## 306                1       0.8742846         1 0.7716551            0
## 307                0       0.8048584         0 0.7148120            0
## 308                0       0.8533203         0 0.5265383            1
## 309                1       0.7046129         0 0.5185688            0
## 310                0       0.5142722         1 0.6569303            1
## 311                0       0.9874550         0 0.9110249            0
## 312                1       0.6887485         0 0.5879564            0
## 313                0       0.8080974         0 0.7097439            0
## 314                1       0.6252458         0 0.5683850            1
## 315                1       0.9092329         1 0.5930706            0
## 316                0       0.6022597         1 0.5000000            1
## 317                1       0.7616674         1 0.6121351            0
## 318                1       0.9974232         1 0.8176693            0
## 319                1       0.9372081         1 0.7842353            0
## 320                1       0.5511488         1 0.6122980            0
## 321                0       0.9823572         0 0.6765989            1
## 322                0       0.9998854         0 0.9582100            0
## 323                1       0.5108931         1 0.5000000            1
## 324                1       0.9980767         1 0.9600171            1
## 325                1       0.9977295         1 0.9588532            1
## 326                1       0.6547592         0 0.5363830            1
## 327                1       0.9996889         1 0.8617034            0
## 328                1       0.9393837         1 0.7009936            0
## 329                0       0.6344353         0 0.5389189            0
## 330                0       0.7500277         0 0.8425403            1
## 331                1       0.6276758         1 0.5495079            0
## 332                0       0.9772530         0 0.6906334            0
## 333                1       0.9412288         1 0.7363577            0
## 334                0       0.9958098         0 0.9390684            0
## 335                1       0.7290147         1 0.6681969            0
## 336                1       0.8397825         1 0.7068088            1
## 337                0       0.9186125         0 0.6583635            0
## 338                0       0.6868231         0 0.5251544            0
## 339                1       0.9971474         1 0.8078827            0
## 340                0       0.6906884         1 0.5960133            0
## 341                1       0.9944000         1 0.7925128            1
## 342                0       0.7181427         0 0.5139650            0
## 343                1       0.9992761         1 0.9173717            0
## 344                1       0.9121243         1 0.7874143            1
## 345                0       0.8652695         0 0.7022282            0
## 346                0       0.5836982         1 0.5248983            0
## 347                0       0.9962145         0 0.8213261            1
## 348                0       0.9758256         0 0.7016412            0
## 349                1       0.9985352         1 0.9003568            1
## 350                1       0.9998321         1 0.9641759            1
## 351                0       0.9949653         0 0.8410647            0
## 352                0       0.9799601         0 0.8509359            0
## 353                0       0.6715760         0 0.5910754            0
## 354                0       0.9828706         0 0.8845325            0
## 355                1       0.5625961         1 0.5802290            0
## 356                0       0.8107950         1 0.5245733            0
## 357                0       0.9836870         0 0.7781616            0
## 358                0       0.9912775         0 0.9253414            0
## 359                1       0.9952790         1 0.7587636            0
## 360                1       0.9999617         1 0.9363963            1
## 361                0       0.9064579         0 0.5866941            0
## 362                1       0.7963198         1 0.5834518            0
## 363                1       0.8290659         1 0.7913857            0
## 364                1       0.8194805         1 0.7754446            1
## 365                0       0.9997643         0 0.9337191            0
## 366                0       0.9456677         0 0.7112533            0
## 367                1       0.9627341         1 0.9392595            0
## 368                0       0.9997182         0 0.9036523            0
## 369                0       0.9955477         0 0.8908229            0
## 370                1       0.9904448         1 0.9486467            0
## 371                0       0.6765416         0 0.6457900            0
## 372                1       0.7438546         1 0.9099114            0
## 373                1       0.9823005         1 0.7935532            0
## 374                1       0.9581116         1 0.6569011            0
## 375                0       0.9817047         0 0.8840865            0
## 376                1       0.8552578         1 0.8000588            1
## 377                1       0.5334343         1 0.5316924            1
## 378                1       0.9974943         1 0.8679484            0
## 379                1       0.7826520         1 0.5547653            0
## 380                0       0.7789108         0 0.7024567            0
## 381                1       0.5535358         1 0.5738954            0
## 382                0       0.9975949         0 0.9138629            0
## 383                1       0.5154175         1 0.5095049            1
## 384                0       0.5296915         1 0.5562362            1
## 385                0       0.9986759         0 0.7469801            0
## 386                0       0.9099514         0 0.7822402            0
## 387                0       0.9978333         0 0.7341256            0
## 388                1       0.9766166         1 0.9209038            1
## 389                0       0.9997648         0 0.9017469            0
## 390                1       0.9991194         1 0.9361449            0
## 391                1       0.9957568         1 0.8800996            1
## 392                0       0.9619506         0 0.8136611            0
## 393                0       0.5739552         1 0.6696623            0
## 394                0       0.9561950         0 0.6693038            0
## 395                1       0.9189469         1 0.7656323            1
## 396                1       0.7340351         1 0.6099127            1
## 397                1       0.7824256         1 0.7928029            0
## 398                0       0.6139407         1 0.5289649            1
## 399                0       0.7919936         0 0.5168807            0
## 400                1       0.9999960         1 0.9699883            1
## 401                0       0.9990518         0 0.7601992            0
## 402                0       0.9967793         0 0.8430637            0
## 403                0       0.8983827         0 0.6422225            0
## 404                1       0.8229126         1 0.6413884            1
## 405                0       0.9969608         0 0.7824713            0
## 406                1       0.9939370         1 0.9022464            0
## 407                0       0.9965315         0 0.8043431            0
## 408                0       0.9982231         0 0.8820974            0
## 409                0       0.8222102         1 0.5068385            1
## 410                1       0.8838366         1 0.6094157            0
## 411                0       0.5066103         0 0.5638507            0
## 412                1       0.7589508         0 0.6008268            0
## 413                0       0.9312361         0 0.6014157            1
## 414                0       0.9986832         0 0.9490443            0
## 415                1       0.9946605         1 0.8630077            1
## 416                0       0.8513694         1 0.5380548            1
## 417                1       0.9998579         1 0.9711560            1
## 418                0       0.9998207         0 0.5608124            0
## 419                0       0.8094248         0 0.7059211            0
## 420                0       0.8617869         0 0.6031757            0
## 421                0       0.9989258         0 0.8490449            0
## 422                0       0.5937938         0 0.5557195            1
## 423                1       0.9298344         1 0.8609913            0
## 424                1       0.9798226         1 0.8980477            1
## 425                1       0.8013949         1 0.7470908            0
## 426                0       0.9316819         0 0.5698706            0
## 427                0       0.9998221         0 0.9049087            0
## 428                0       0.9984114         0 0.8724760            0
## 429                1       0.9302210         1 0.6890074            1
## 430                0       0.8472849         1 0.5663793            0
## 431                0       0.8020082         0 0.6254264            0
## 432                0       0.7092421         0 0.6537924            0
## 433                1       0.9334738         1 0.5494953            1
## 434                1       0.9999748         1 0.9294990            0
## 435                0       0.9997997         0 0.9031776            0
## 436                0       0.9602578         0 0.6620159            0
## 437                1       0.9904479         1 0.8185883            0
## 438                1       0.7386074         0 0.5368206            0
## 439                0       0.6909321         1 0.6245582            0
## 440                1       0.8799268         1 0.5492631            0
## 441                1       0.8269256         1 0.5363651            0
## 442                1       0.9113292         1 0.6427528            0
## 443                0       0.7325425         1 0.6825822            0
## 444                0       0.5659113         1 0.6271871            1
## 445                0       0.6018692         0 0.5469666            0
## 446                1       0.9756759         1 0.8160235            1
## 447                0       0.9970487         0 0.8787692            0
## 448                1       0.8878893         1 0.6661880            0
## 449                1       0.9840432         1 0.7293530            0
## 450                1       0.7217355         0 0.6559369            0
## 451                1       0.9892149         1 0.9072151            0
## 452                0       0.9996935         0 0.9570788            0
## 453                0       0.7753398         0 0.6493737            0
## 454                0       0.7259929         1 0.5541011            0
## 455                1       0.8457212         1 0.7158160            0
## 456                0       0.5682521         1 0.5740732            1
## 457                1       0.9938881         1 0.9566009            1
## 458                0       0.6883954         1 0.8275587            0
## 459                1       0.9964929         1 0.9129558            1
## 460                1       0.8041002         1 0.6125285            1
## 461                1       0.5657612         0 0.5866070            1
## 462                0       0.9183511         0 0.7969710            0
## 463                0       0.6819846         0 0.6006783            1
## 464                1       0.6229835         1 0.5110778            0
## 465                1       0.5455476         1 0.6408842            1
## 466                0       0.9973987         0 0.8700094            0
## 467                1       0.9998983         1 0.9512236            0
## 468                1       0.9288827         1 0.6691967            0
## 469                0       0.9749746         0 0.7240163            1
## 470                0       0.9998129         0 0.8663662            0
## 471                1       0.6701653         1 0.6554700            0
## 472                1       0.9928055         1 0.9055379            1
## 473                0       0.9875350         0 0.8318631            0
## 474                1       0.8176029         1 0.5697387            0
## 475                0       0.9083791         1 0.8615574            0
## 476                0       0.9957045         1 0.5000000            0
## 477                0       0.8678525         0 0.5833664            0
## 478                0       0.9962014         0 0.7850048            0
## 479                1       0.8881209         1 0.6602523            0
## 480                0       0.9559366         0 0.6578005            0
## 481                0       0.8488480         0 0.7511494            0
## 482                0       0.9799139         0 0.6711304            0
## 483                0       0.8628035         0 0.6832332            0
## 484                1       0.8907258         1 0.6284531            0
## 485                1       0.7590302         0 0.5070122            0
## 486                0       0.8203887         0 0.6467645            0
## 487                0       0.6662094         1 0.5854988            0
## 488                0       0.9782478         0 0.7669792            0
## 489                1       0.9624025         1 0.9228885            1
## 490                0       0.5442808         0 0.6239877            0
## 491                1       0.6077986         0 0.6384102            0
## 492                0       0.9468797         0 0.7095526            0
## 493                1       0.7939146         1 0.5192914            0
## 494                0       0.9913580         0 0.8276100            0
## 495                1       0.9990438         1 0.9416137            1
## 496                1       0.8741110         1 0.9009747            1
## 497                1       0.9723695         1 0.9419218            1
## 498                0       0.8444692         0 0.7804383            0
## 499                0       0.6896941         0 0.5090347            0
## 500                0       0.9903632         0 0.7460142            0
##     GLMNET_PROB TREE_LABEL TREE_PROB MANUAL_CODE CONSENSUS_CODE
## 1     0.8738042          0 0.6708229           1              1
## 2     0.8850938          0 0.6708229           0              0
## 3     0.7534000          1 1.0000000           1              1
## 4     0.6219631          0 0.6708229           1              1
## 5     0.9182613          1 1.0000000           1              1
## 6     0.7615408          0 0.6708229           1              0
## 7     0.9035011          0 0.6708229           0              0
## 8     0.8826468          0 0.6708229           1              0
## 9     0.8278062          0 0.6708229           0              1
## 10    0.8185359          0 0.6708229           0              0
## 11    0.9037282          0 0.6708229           0              0
## 12    0.5011803          0 0.6708229           1              0
## 13    0.5607554          0 0.6708229           1              0
## 14    0.7602066          0 0.6708229           0              0
## 15    0.7320789          0 0.6708229           1              0
## 16    0.8542562          0 0.6708229           0              0
## 17    0.8607859          0 0.6708229           1              0
## 18    0.9285921          0 0.6708229           1              0
## 19    0.7003633          0 0.6708229           0              0
## 20    0.8320338          0 0.6708229           0              0
## 21    0.6003213          0 0.6708229           1              1
## 22    0.7751497          1 1.0000000           0              1
## 23    0.7778404          0 0.6708229           0              0
## 24    0.8002510          0 0.6708229           1              0
## 25    0.8826468          0 0.6708229           0              0
## 26    0.8826468          0 0.6708229           0              0
## 27    0.8626350          0 0.6708229           0              0
## 28    0.7235597          0 0.6708229           1              0
## 29    0.8826468          0 0.6708229           1              0
## 30    0.7469693          0 0.6708229           1              0
## 31    0.6485944          0 0.6708229           0              0
## 32    0.9022551          0 0.6708229           1              0
## 33    0.5380154          0 0.6708229           1              1
## 34    0.9104588          0 0.6708229           1              1
## 35    0.5563658          0 0.6708229           0              0
## 36    0.7446572          0 0.6708229           0              0
## 37    0.8826468          0 0.6708229           1              0
## 38    0.8963598          0 0.6708229           0              0
## 39    0.8826468          0 0.6708229           1              0
## 40    0.8516072          0 0.6708229           1              1
## 41    0.9631672          1 1.0000000           1              1
## 42    0.8826468          0 0.6708229           0              0
## 43    0.8241264          0 0.6708229           0              0
## 44    0.8583391          1 1.0000000           1              0
## 45    0.9081466          0 0.6708229           1              1
## 46    0.8580203          0 0.6708229           1              0
## 47    0.8406668          0 0.6708229           1              1
## 48    0.8826468          0 0.6708229           0              0
## 49    0.8826468          0 0.6708229           0              0
## 50    0.7412744          1 1.0000000           1              1
## 51    0.8826468          0 0.6708229           1              0
## 52    0.8826468          0 0.6708229           0              0
## 53    0.8826468          0 0.6708229           0              0
## 54    0.7213425          1 1.0000000           0              0
## 55    0.6378388          0 0.6708229           0              0
## 56    0.5824950          0 0.6708229           1              1
## 57    0.6691688          0 0.6708229           0              0
## 58    0.8826468          0 0.6708229           0              0
## 59    0.6446381          1 1.0000000           1              1
## 60    0.5118059          0 0.6708229           0              0
## 61    0.8826468          0 0.6708229           1              0
## 62    0.8826468          0 0.6708229           1              0
## 63    0.8624790          0 0.6708229           1              1
## 64    0.7750470          0 0.6708229           0              0
## 65    0.8826468          0 0.6708229           1              0
## 66    0.6172304          0 0.6708229           1              1
## 67    0.8826468          0 0.6708229           0              0
## 68    0.6861277          0 0.6708229           1              1
## 69    0.7446878          0 0.6708229           1              0
## 70    0.6261146          0 0.6708229           1              0
## 71    0.5000492          0 0.6708229           0              0
## 72    0.8872076          0 0.6708229           0              0
## 73    0.8501045          0 0.6708229           0              0
## 74    0.7050580          0 0.6708229           1              0
## 75    0.5013278          0 0.6708229           1              1
## 76    0.8826468          0 0.6708229           1              0
## 77    0.7503054          0 0.6708229           0              0
## 78    0.8826468          0 0.6708229           0              0
## 79    0.8873546          0 0.6708229           1              0
## 80    0.6227985          0 0.6708229           1              1
## 81    0.8826468          0 0.6708229           0              0
## 82    0.9122352          0 0.6708229           0              0
## 83    0.7828464          0 0.6708229           0              0
## 84    0.8631442          0 0.6708229           0              0
## 85    0.9839913          1 1.0000000           0              0
## 86    0.8826468          0 0.6708229           1              0
## 87    0.8826468          0 0.6708229           0              0
## 88    0.8903605          0 0.6708229           0              0
## 89    0.8581920          0 0.6708229           0              0
## 90    0.9934483          0 0.6708229           0              0
## 91    0.6881750          0 0.6708229           1              0
## 92    0.9694929          0 0.6708229           0              1
## 93    0.9027437          0 0.6708229           0              0
## 94    0.8826468          0 0.6708229           0              0
## 95    0.8919177          0 0.6708229           1              1
## 96    0.7991811          0 0.6708229           1              0
## 97    0.8117634          0 0.6708229           1              0
## 98    0.6594853          0 0.6708229           0              1
## 99    0.8826468          0 0.6708229           1              0
## 100   0.8593859          0 0.6708229           0              0
## 101   0.9444343          0 0.6708229           1              1
## 102   0.8782488          0 0.6708229           1              0
## 103   0.5051447          0 0.6708229           0              0
## 104   0.8184098          0 0.6708229           0              0
## 105   0.8826468          0 0.6708229           1              0
## 106   0.8890713          0 0.6708229           1              0
## 107   0.8706918          0 0.6708229           1              0
## 108   0.8177046          0 0.6708229           1              0
## 109   0.5632699          0 0.6708229           0              0
## 110   0.8826468          0 0.6708229           1              0
## 111   0.5650449          0 0.6708229           1              1
## 112   0.8826468          0 0.6708229           1              0
## 113   0.5989644          0 0.6708229           1              1
## 114   0.8826468          0 0.6708229           0              0
## 115   0.8654609          0 0.6708229           0              0
## 116   0.9662047          0 0.6708229           0              0
## 117   0.6084465          1 1.0000000           1              1
## 118   0.8826468          0 0.6708229           1              0
## 119   0.6433446          0 0.6708229           0              0
## 120   0.8553824          0 0.6708229           0              0
## 121   0.5087508          0 0.6708229           1              1
## 122   0.7082844          0 0.6708229           1              0
## 123   0.9784497          0 0.6708229           0              0
## 124   0.7384513          0 0.6708229           1              1
## 125   0.5306107          0 0.6708229           1              0
## 126   0.9852356          0 0.6708229           0              0
## 127   0.9741915          0 0.6708229           1              1
## 128   0.9685879          1 1.0000000           1              1
## 129   0.9522608          0 0.6708229           1              1
## 130   0.5203675          0 0.6708229           0              0
## 131   0.7380448          0 0.6708229           0              0
## 132   0.7763976          0 0.6708229           1              0
## 133   0.8826468          0 0.6708229           1              0
## 134   0.8826468          0 0.6708229           0              0
## 135   0.7249276          1 1.0000000           1              1
## 136   0.8815475          0 0.6708229           1              0
## 137   0.7908878          0 0.6708229           1              1
## 138   0.9416539          0 0.6708229           0              0
## 139   0.6783751          1 1.0000000           1              1
## 140   0.5587220          0 0.6708229           1              0
## 141   0.6190478          0 0.6708229           1              1
## 142   0.8607271          0 0.6708229           1              1
## 143   0.8361274          0 0.6708229           0              0
## 144   0.5577047          0 0.6708229           0              0
## 145   0.8162036          0 0.6708229           1              0
## 146   0.8848173          0 0.6708229           1              0
## 147   0.8826468          0 0.6708229           0              0
## 148   0.6736772          0 0.6708229           1              0
## 149   0.8870438          0 0.6708229           0              0
## 150   0.9238860          1 1.0000000           1              1
## 151   0.8157534          1 1.0000000           1              1
## 152   0.9168739          0 0.6708229           0              0
## 153   0.9232478          0 0.6708229           1              0
## 154   0.9164308          0 0.6708229           0              0
## 155   0.8976545          0 0.6708229           1              0
## 156   0.8826468          0 0.6708229           0              0
## 157   0.8321047          0 0.6708229           0              0
## 158   0.7429309          0 0.6708229           1              0
## 159   0.5729821          0 0.6708229           0              0
## 160   0.7968580          0 0.6708229           1              1
## 161   0.5446119          0 0.6708229           0              1
## 162   0.8826468          0 0.6708229           1              0
## 163   0.7240288          0 0.6708229           1              0
## 164   0.8826468          0 0.6708229           0              0
## 165   0.9022177          0 0.6708229           1              0
## 166   0.6062668          0 0.6708229           0              0
## 167   0.8826468          0 0.6708229           0              0
## 168   0.5620847          0 0.6708229           0              0
## 169   0.8314984          0 0.6708229           0              0
## 170   0.8826468          0 0.6708229           1              0
## 171   0.8380605          0 0.6708229           0              0
## 172   0.6284753          0 0.6708229           0              0
## 173   0.9377734          0 0.6708229           0              1
## 174   0.6191081          0 0.6708229           0              1
## 175   0.5595284          0 0.6708229           1              0
## 176   0.7914111          0 0.6708229           1              0
## 177   0.8826468          0 0.6708229           0              0
## 178   0.8826468          0 0.6708229           0              0
## 179   0.8655210          0 0.6708229           0              0
## 180   0.9770502          0 0.6708229           1              1
## 181   0.8826468          0 0.6708229           0              0
## 182   0.9143001          0 0.6708229           1              0
## 183   0.8645293          0 0.6708229           1              0
## 184   0.7826995          1 1.0000000           1              1
## 185   0.9110108          0 0.6708229           1              1
## 186   0.7412316          0 0.6708229           1              0
## 187   0.7622770          0 0.6708229           1              0
## 188   0.8499405          1 1.0000000           1              1
## 189   0.8826468          0 0.6708229           0              0
## 190   0.6895738          0 0.6708229           0              0
## 191   0.9561761          0 0.6708229           0              0
## 192   0.7928940          0 0.6708229           0              0
## 193   0.6289152          0 0.6708229           1              0
## 194   0.8011381          0 0.6708229           0              0
## 195   0.9197499          0 0.6708229           0              0
## 196   0.8826468          0 0.6708229           0              0
## 197   0.8434611          0 0.6708229           0              0
## 198   0.7304755          0 0.6708229           1              0
## 199   0.8826468          0 0.6708229           0              0
## 200   0.8896595          0 0.6708229           0              0
## 201   0.5126233          1 1.0000000           0              0
## 202   0.7924248          0 0.6708229           1              0
## 203   0.8107979          1 1.0000000           0              0
## 204   0.8826468          0 0.6708229           1              0
## 205   0.9730174          0 0.6708229           1              1
## 206   0.9512029          0 0.6708229           0              0
## 207   0.9249040          0 0.6708229           0              0
## 208   0.5642125          0 0.6708229           1              0
## 209   0.8878853          0 0.6708229           0              0
## 210   0.7952181          1 1.0000000           1              1
## 211   0.7237680          0 0.6708229           1              0
## 212   0.5761923          0 0.6708229           1              1
## 213   0.6367940          0 0.6708229           1              0
## 214   0.8696664          0 0.6708229           0              0
## 215   0.8533389          0 0.6708229           0              0
## 216   0.5313423          1 1.0000000           0              0
## 217   0.8826468          0 0.6708229           0              0
## 218   0.8826468          0 0.6708229           0              0
## 219   0.8826468          0 0.6708229           1              0
## 220   0.7799335          0 0.6708229           0              0
## 221   0.9796394          0 0.6708229           1              1
## 222   0.7765558          1 0.8888889           0              0
## 223   0.8696224          0 0.6708229           1              1
## 224   0.7189083          0 0.6708229           1              0
## 225   0.8826468          0 0.6708229           0              0
## 226   0.9831908          1 1.0000000           1              1
## 227   0.6702288          0 0.6708229           1              1
## 228   0.7413279          0 0.6708229           0              0
## 229   0.9404426          0 0.6708229           0              0
## 230   0.8826468          1 0.8888889           1              1
## 231   0.8611615          0 0.6708229           1              1
## 232   0.8826468          0 0.6708229           0              0
## 233   0.8690566          1 1.0000000           1              1
## 234   0.9366514          0 0.6708229           1              0
## 235   0.7453706          1 1.0000000           0              0
## 236   0.8401707          0 0.6708229           1              0
## 237   0.8826468          0 0.6708229           1              0
## 238   0.8421362          0 0.6708229           1              1
## 239   0.7001846          0 0.6708229           1              0
## 240   0.5326745          0 0.6708229           0              0
## 241   0.8826468          0 0.6708229           0              0
## 242   0.7822997          0 0.6708229           1              0
## 243   0.6245428          0 0.6708229           0              0
## 244   0.8826468          0 0.6708229           0              0
## 245   0.8224234          0 0.6708229           0              0
## 246   0.5013809          0 0.6708229           0              0
## 247   0.8270290          0 0.6708229           1              1
## 248   0.7361244          0 0.6708229           1              0
## 249   0.7558985          0 0.6708229           1              0
## 250   0.9231819          1 0.8750000           0              1
## 251   0.9973757          0 0.6708229           1              1
## 252   0.8960521          0 0.6708229           0              0
## 253   0.7785969          0 0.6708229           1              0
## 254   0.8826468          0 0.6708229           0              0
## 255   0.6776437          0 0.6708229           0              0
## 256   0.7882810          0 0.6708229           0              0
## 257   0.8303028          0 0.6708229           0              0
## 258   0.6170695          0 0.6708229           0              0
## 259   0.6823105          0 0.6708229           1              1
## 260   0.8826468          0 0.6708229           0              0
## 261   0.5571376          0 0.6708229           1              0
## 262   0.6838146          0 0.6708229           0              0
## 263   0.6871158          0 0.6708229           1              0
## 264   0.7210089          0 0.6708229           0              0
## 265   0.7421615          0 0.6708229           1              0
## 266   0.9845041          0 0.6708229           0              0
## 267   0.8826468          0 0.6708229           1              0
## 268   0.8547102          0 0.6708229           0              1
## 269   0.8854799          0 0.6708229           0              0
## 270   0.7392797          0 0.6708229           1              0
## 271   0.5360010          0 0.6708229           0              0
## 272   0.6495459          0 0.6708229           0              0
## 273   0.8239879          0 0.6708229           0              0
## 274   0.9156437          0 0.6708229           0              0
## 275   0.8578701          0 0.6708229           1              0
## 276   0.9027171          0 0.6708229           0              0
## 277   0.8826468          0 0.6708229           0              0
## 278   0.9257876          0 0.6708229           0              0
## 279   0.5071815          0 0.6708229           1              1
## 280   0.8826468          0 0.6708229           1              0
## 281   0.8826468          0 0.6708229           1              0
## 282   0.6771739          0 0.6708229           0              0
## 283   0.8964580          0 0.6708229           0              0
## 284   0.9668586          0 0.6708229           1              1
## 285   0.6584277          0 0.6708229           1              0
## 286   0.5230109          0 0.6708229           1              0
## 287   0.7385003          0 0.6708229           0              0
## 288   0.5785883          1 1.0000000           1              1
## 289   0.7742268          0 0.6708229           0              0
## 290   0.8934302          0 0.6708229           0              0
## 291   0.6021430          0 0.6708229           1              0
## 292   0.6829581          0 0.6708229           0              0
## 293   0.9285511          0 0.6708229           1              0
## 294   0.8056249          0 0.6708229           1              0
## 295   0.8962151          1 1.0000000           1              1
## 296   0.7603263          0 0.6708229           0              0
## 297   0.8826468          0 0.6708229           1              0
## 298   0.7912943          0 0.6708229           1              0
## 299   0.8286975          0 0.6708229           1              0
## 300   0.8667125          0 0.6708229           1              0
## 301   0.7365683          0 0.6708229           1              0
## 302   0.8815458          0 0.6708229           1              1
## 303   0.5807256          0 0.6708229           1              0
## 304   0.6331637          0 0.6708229           1              1
## 305   0.9364669          0 0.6708229           1              1
## 306   0.8053236          0 0.6708229           1              0
## 307   0.7299061          0 0.6708229           0              0
## 308   0.7815385          0 0.6708229           1              0
## 309   0.6943886          0 0.6708229           0              0
## 310   0.8462311          1 1.0000000           0              1
## 311   0.8971622          0 0.6708229           0              0
## 312   0.7807314          0 0.6708229           0              0
## 313   0.6971389          0 0.6708229           0              0
## 314   0.5607658          0 0.6708229           1              0
## 315   0.8826468          0 0.6708229           0              0
## 316   0.5154505          1 1.0000000           1              1
## 317   0.8889392          0 0.6708229           0              0
## 318   0.8867790          0 0.6708229           1              0
## 319   0.5414510          0 0.6708229           1              0
## 320   0.7657597          0 0.6708229           0              0
## 321   0.5742840          0 0.6708229           0              0
## 322   0.8718243          0 0.6708229           0              0
## 323   0.8563444          0 0.6708229           1              1
## 324   0.9401575          0 0.6708229           1              1
## 325   0.9531254          0 0.6708229           1              1
## 326   0.6293402          0 0.6708229           1              0
## 327   0.8642493          0 0.6708229           1              0
## 328   0.8826468          0 0.6708229           1              0
## 329   0.7047220          0 0.6708229           1              0
## 330   0.6537832          0 0.6708229           0              0
## 331   0.6376553          1 1.0000000           0              1
## 332   0.9482859          0 0.6708229           1              0
## 333   0.6789060          0 0.6708229           1              0
## 334   0.6691175          0 0.6708229           0              0
## 335   0.8826468          0 0.6708229           0              0
## 336   0.8396531          0 0.6708229           1              1
## 337   0.8567651          0 0.6708229           0              0
## 338   0.8576621          0 0.6708229           0              0
## 339   0.7193951          0 0.6708229           1              0
## 340   0.8363542          0 0.6708229           1              0
## 341   0.6966740          0 0.6708229           1              1
## 342   0.6012561          0 0.6708229           1              0
## 343   0.7845121          1 0.8888889           1              1
## 344   0.9110769          0 0.6708229           1              1
## 345   0.8826468          0 0.6708229           1              0
## 346   0.7324585          1 1.0000000           0              0
## 347   0.6219631          0 0.6708229           0              0
## 348   0.8826468          0 0.6708229           0              0
## 349   0.9240436          0 0.6708229           1              1
## 350   0.5526521          0 0.6708229           1              1
## 351   0.8826468          0 0.6708229           0              0
## 352   0.8928720          0 0.6708229           0              0
## 353   0.8676456          1 0.8888889           0              0
## 354   0.8874946          0 0.6708229           0              0
## 355   0.5199573          0 0.6708229           0              0
## 356   0.5533062          0 0.6708229           1              0
## 357   0.7564863          0 0.6708229           0              0
## 358   0.8826468          0 0.6708229           0              0
## 359   0.7817740          0 0.6708229           1              0
## 360   0.5163733          0 0.6708229           1              1
## 361   0.5847168          0 0.6708229           0              0
## 362   0.5591346          0 0.6708229           1              0
## 363   0.8826468          0 0.6708229           1              0
## 364   0.7492297          0 0.6708229           0              1
## 365   0.9206717          0 0.6708229           0              0
## 366   0.7646499          0 0.6708229           0              0
## 367   0.5954523          0 0.6708229           1              0
## 368   0.7959084          0 0.6708229           0              0
## 369   0.9040298          0 0.6708229           0              0
## 370   0.7622454          0 0.6708229           1              0
## 371   0.8623096          0 0.6708229           0              0
## 372   0.6537384          0 0.6708229           1              0
## 373   0.6389210          1 1.0000000           0              1
## 374   0.8923437          0 0.6708229           0              0
## 375   0.8826468          0 0.6708229           0              0
## 376   0.9762260          1 1.0000000           1              1
## 377   0.5727929          0 0.6708229           0              1
## 378   0.7028869          0 0.6708229           1              0
## 379   0.8826468          0 0.6708229           0              0
## 380   0.7945011          1 1.0000000           0              0
## 381   0.5503057          0 0.6708229           1              0
## 382   0.8979848          0 0.6708229           0              0
## 383   0.5259948          0 0.6708229           0              1
## 384   0.7260116          0 0.6708229           1              0
## 385   0.7264685          0 0.6708229           1              0
## 386   0.8826468          0 0.6708229           0              0
## 387   0.6699350          0 0.6708229           0              0
## 388   0.6368762          0 0.6708229           1              1
## 389   0.8826468          0 0.6708229           0              0
## 390   0.8370683          0 0.6708229           1              0
## 391   0.7182565          0 0.6708229           1              1
## 392   0.6821077          1 1.0000000           0              0
## 393   0.8342533          0 0.6708229           1              0
## 394   0.8266471          0 0.6708229           0              0
## 395   0.9194206          0 0.6708229           1              1
## 396   0.8015432          0 0.6708229           1              1
## 397   0.7939010          1 1.0000000           0              1
## 398   0.6098056          0 0.6708229           1              0
## 399   0.7262814          0 0.6708229           0              0
## 400   0.9974343          1 1.0000000           1              1
## 401   0.8826468          0 0.6708229           0              0
## 402   0.8918488          1 1.0000000           0              0
## 403   0.8826468          0 0.6708229           0              0
## 404   0.8305955          0 0.6708229           1              1
## 405   0.8826468          0 0.6708229           1              0
## 406   0.8826468          0 0.6708229           1              0
## 407   0.9438059          0 0.6708229           0              0
## 408   0.8826468          0 0.6708229           0              0
## 409   0.7088270          1 1.0000000           0              1
## 410   0.6005041          0 0.6708229           0              0
## 411   0.6433489          0 0.6708229           0              0
## 412   0.9582068          0 0.6708229           1              0
## 413   0.8460633          0 0.6708229           0              0
## 414   0.8747303          0 0.6708229           0              0
## 415   0.7526146          0 0.6708229           1              1
## 416   0.8705330          0 0.6708229           0              0
## 417   0.9908336          0 0.6708229           1              1
## 418   0.8826468          0 0.6708229           0              0
## 419   0.8826468          0 0.6708229           1              0
## 420   0.8951898          0 0.6708229           1              0
## 421   0.8826468          0 0.6708229           1              0
## 422   0.6050418          1 1.0000000           0              0
## 423   0.6377939          0 0.6708229           1              0
## 424   0.9812487          0 0.6708229           1              1
## 425   0.7938011          0 0.6708229           1              0
## 426   0.8826468          0 0.6708229           0              0
## 427   0.9727102          0 0.6708229           0              0
## 428   0.8826468          0 0.6708229           0              0
## 429   0.5335150          0 0.6708229           1              1
## 430   0.6786484          0 0.6708229           1              0
## 431   0.9251969          0 0.6708229           0              0
## 432   0.8826468          0 0.6708229           0              0
## 433   0.6103072          0 0.6708229           1              1
## 434   0.7400782          0 0.6708229           1              0
## 435   0.8826468          0 0.6708229           0              0
## 436   0.9178899          0 0.6708229           0              0
## 437   0.8807102          0 0.6708229           1              0
## 438   0.6527935          1 0.8888889           0              0
## 439   0.8826468          0 0.6708229           0              0
## 440   0.7764394          0 0.6708229           0              0
## 441   0.8118343          0 0.6708229           1              0
## 442   0.9691737          0 0.6708229           0              0
## 443   0.8826468          0 0.6708229           1              0
## 444   0.9516122          0 0.6708229           0              0
## 445   0.8625652          0 0.6708229           0              0
## 446   0.8900183          0 0.6708229           1              1
## 447   0.8971130          0 0.6708229           0              0
## 448   0.7727855          0 0.6708229           1              0
## 449   0.6110396          0 0.6708229           0              0
## 450   0.8826468          0 0.6708229           0              0
## 451   0.6718131          0 0.6708229           0              0
## 452   0.8957181          0 0.6708229           0              0
## 453   0.9514845          0 0.6708229           0              0
## 454   0.6750882          0 0.6708229           0              0
## 455   0.7029653          1 1.0000000           1              1
## 456   0.5696332          0 0.6708229           1              0
## 457   0.9631948          1 1.0000000           1              1
## 458   0.8826468          0 0.6708229           1              0
## 459   0.6205124          1 1.0000000           1              1
## 460   0.7671619          0 0.6708229           1              1
## 461   0.5304241          0 0.6708229           0              0
## 462   0.8926163          0 0.6708229           0              0
## 463   0.9574430          0 0.6708229           1              0
## 464   0.7346101          1 1.0000000           1              1
## 465   0.8545097          0 0.6708229           1              1
## 466   0.6748597          0 0.6708229           0              0
## 467   0.6710538          0 0.6708229           1              0
## 468   0.8110653          0 0.6708229           1              0
## 469   0.9614801          0 0.6708229           0              0
## 470   0.8826468          0 0.6708229           0              0
## 471   0.6268481          0 0.6708229           1              0
## 472   0.8324457          1 0.8888889           1              1
## 473   0.5992754          0 0.6708229           0              0
## 474   0.9224705          0 0.6708229           0              0
## 475   0.8826468          0 0.6708229           0              0
## 476   0.6173178          0 0.6708229           0              0
## 477   0.8441209          0 0.6708229           0              0
## 478   0.5131976          0 0.6708229           0              0
## 479   0.5436847          0 0.6708229           1              0
## 480   0.7228243          0 0.6708229           0              0
## 481   0.7830620          0 0.6708229           0              0
## 482   0.5417731          0 0.6708229           0              0
## 483   0.8826468          0 0.6708229           1              0
## 484   0.5797255          0 0.6708229           1              0
## 485   0.7649127          1 1.0000000           0              0
## 486   0.8480237          0 0.6708229           0              0
## 487   0.8320971          0 0.6708229           0              0
## 488   0.8826468          0 0.6708229           0              0
## 489   0.7535356          0 0.6708229           1              1
## 490   0.7639696          0 0.6708229           1              0
## 491   0.5532418          0 0.6708229           0              0
## 492   0.8826468          0 0.6708229           1              0
## 493   0.8737815          0 0.6708229           1              0
## 494   0.8219490          0 0.6708229           0              0
## 495   0.6303809          1 1.0000000           1              1
## 496   0.5441030          0 0.6708229           1              1
## 497   0.9138387          0 0.6708229           1              1
## 498   0.8427927          0 0.6708229           1              0
## 499   0.8046984          1 1.0000000           0              0
## 500   0.8007609          0 0.6708229           0              0
##     CONSENSUS_AGREE CONSENSUS_INCORRECT PROBABILITY_CODE
## 1                 3                   0                1
## 2                 4                   0                0
## 3                 3                   0                1
## 4                 3                   0                1
## 5                 4                   0                1
## 6                 2                   1                1
## 7                 4                   0                0
## 8                 2                   1                1
## 9                 3                   1                1
## 10                2                   0                1
## 11                2                   0                1
## 12                2                   1                1
## 13                2                   1                1
## 14                3                   0                0
## 15                4                   1                0
## 16                4                   0                0
## 17                2                   1                1
## 18                2                   1                0
## 19                2                   0                1
## 20                4                   0                0
## 21                3                   0                1
## 22                3                   1                1
## 23                3                   0                0
## 24                2                   1                1
## 25                4                   0                0
## 26                4                   0                0
## 27                2                   0                1
## 28                2                   1                1
## 29                2                   1                1
## 30                4                   1                0
## 31                4                   0                0
## 32                4                   1                0
## 33                3                   0                1
## 34                3                   0                1
## 35                3                   0                0
## 36                4                   0                0
## 37                4                   1                0
## 38                4                   0                0
## 39                4                   1                0
## 40                3                   0                1
## 41                4                   0                1
## 42                4                   0                0
## 43                2                   0                0
## 44                2                   1                1
## 45                3                   0                1
## 46                3                   1                0
## 47                3                   0                1
## 48                4                   0                0
## 49                4                   0                0
## 50                4                   0                1
## 51                2                   1                1
## 52                4                   0                0
## 53                4                   0                0
## 54                3                   0                1
## 55                4                   0                0
## 56                3                   0                1
## 57                2                   0                0
## 58                4                   0                0
## 59                4                   0                1
## 60                3                   0                0
## 61                4                   1                0
## 62                2                   1                0
## 63                3                   0                1
## 64                3                   0                0
## 65                3                   1                0
## 66                3                   0                0
## 67                4                   0                0
## 68                3                   0                1
## 69                2                   1                1
## 70                2                   1                1
## 71                2                   0                1
## 72                4                   0                0
## 73                4                   0                0
## 74                2                   1                1
## 75                3                   0                1
## 76                3                   1                0
## 77                4                   0                0
## 78                3                   0                0
## 79                3                   1                0
## 80                3                   0                1
## 81                4                   0                0
## 82                4                   0                0
## 83                4                   0                0
## 84                4                   0                0
## 85                2                   0                1
## 86                4                   1                0
## 87                3                   0                0
## 88                4                   0                0
## 89                4                   0                0
## 90                3                   0                0
## 91                2                   1                1
## 92                3                   1                1
## 93                4                   0                0
## 94                4                   0                0
## 95                3                   0                1
## 96                2                   1                1
## 97                2                   1                0
## 98                3                   1                1
## 99                2                   1                1
## 100               2                   0                1
## 101               3                   0                1
## 102               2                   1                0
## 103               3                   0                0
## 104               4                   0                0
## 105               2                   1                0
## 106               4                   1                0
## 107               2                   1                1
## 108               3                   1                0
## 109               3                   0                0
## 110               4                   1                0
## 111               3                   0                1
## 112               3                   1                1
## 113               3                   0                1
## 114               4                   0                0
## 115               2                   0                0
## 116               3                   0                0
## 117               3                   0                1
## 118               2                   1                1
## 119               2                   0                0
## 120               4                   0                0
## 121               3                   0                1
## 122               2                   1                1
## 123               2                   0                1
## 124               3                   0                1
## 125               2                   1                1
## 126               4                   0                0
## 127               3                   0                1
## 128               4                   0                1
## 129               3                   0                1
## 130               4                   0                0
## 131               4                   0                0
## 132               2                   1                1
## 133               2                   1                1
## 134               4                   0                0
## 135               4                   0                1
## 136               2                   1                1
## 137               3                   0                1
## 138               4                   0                0
## 139               4                   0                1
## 140               2                   1                1
## 141               3                   0                1
## 142               3                   0                1
## 143               4                   0                0
## 144               3                   0                0
## 145               4                   1                0
## 146               4                   1                0
## 147               2                   0                1
## 148               2                   1                1
## 149               4                   0                0
## 150               4                   0                1
## 151               4                   0                1
## 152               4                   0                0
## 153               4                   1                0
## 154               4                   0                0
## 155               4                   1                0
## 156               2                   0                1
## 157               4                   0                0
## 158               2                   1                1
## 159               4                   0                0
## 160               3                   0                1
## 161               3                   1                1
## 162               4                   1                0
## 163               2                   1                1
## 164               4                   0                0
## 165               4                   1                0
## 166               2                   0                0
## 167               4                   0                0
## 168               2                   0                1
## 169               4                   0                0
## 170               3                   1                0
## 171               4                   0                0
## 172               3                   0                0
## 173               3                   1                1
## 174               3                   1                1
## 175               2                   1                1
## 176               2                   1                0
## 177               2                   0                0
## 178               4                   0                0
## 179               4                   0                0
## 180               3                   0                1
## 181               4                   0                0
## 182               2                   1                0
## 183               2                   1                1
## 184               4                   0                1
## 185               3                   0                1
## 186               2                   1                1
## 187               3                   1                0
## 188               3                   0                1
## 189               2                   0                0
## 190               4                   0                0
## 191               3                   0                0
## 192               3                   0                0
## 193               2                   1                1
## 194               4                   0                0
## 195               3                   0                0
## 196               4                   0                0
## 197               4                   0                0
## 198               3                   1                0
## 199               2                   0                1
## 200               4                   0                0
## 201               2                   0                1
## 202               2                   1                1
## 203               2                   0                1
## 204               4                   1                0
## 205               3                   0                1
## 206               4                   0                0
## 207               3                   0                0
## 208               2                   1                1
## 209               2                   0                0
## 210               3                   0                1
## 211               2                   1                0
## 212               3                   0                1
## 213               2                   1                0
## 214               4                   0                0
## 215               4                   0                0
## 216               2                   0                1
## 217               4                   0                0
## 218               4                   0                0
## 219               3                   1                0
## 220               4                   0                0
## 221               3                   0                1
## 222               2                   0                1
## 223               3                   0                1
## 224               3                   1                0
## 225               4                   0                0
## 226               4                   0                1
## 227               3                   0                1
## 228               4                   0                0
## 229               4                   0                0
## 230               3                   0                1
## 231               3                   0                1
## 232               4                   0                0
## 233               4                   0                1
## 234               4                   1                0
## 235               2                   0                1
## 236               4                   1                0
## 237               2                   1                1
## 238               3                   0                1
## 239               2                   1                1
## 240               4                   0                0
## 241               4                   0                0
## 242               4                   1                0
## 243               4                   0                0
## 244               4                   0                0
## 245               3                   0                0
## 246               2                   0                1
## 247               3                   0                1
## 248               2                   1                1
## 249               2                   1                1
## 250               4                   1                1
## 251               3                   0                1
## 252               4                   0                0
## 253               2                   1                1
## 254               4                   0                0
## 255               3                   0                0
## 256               3                   0                1
## 257               4                   0                0
## 258               2                   0                1
## 259               3                   0                1
## 260               2                   0                1
## 261               2                   1                1
## 262               4                   0                0
## 263               2                   1                1
## 264               4                   0                0
## 265               2                   1                1
## 266               4                   0                0
## 267               2                   1                0
## 268               3                   1                1
## 269               4                   0                0
## 270               2                   1                1
## 271               3                   0                0
## 272               4                   0                0
## 273               4                   0                0
## 274               4                   0                0
## 275               3                   1                0
## 276               4                   0                0
## 277               4                   0                0
## 278               4                   0                0
## 279               3                   0                1
## 280               4                   1                0
## 281               3                   1                0
## 282               4                   0                0
## 283               4                   0                0
## 284               3                   0                1
## 285               4                   1                0
## 286               2                   1                0
## 287               4                   0                0
## 288               4                   0                1
## 289               3                   0                0
## 290               4                   0                0
## 291               2                   1                1
## 292               3                   0                0
## 293               2                   1                0
## 294               3                   1                0
## 295               4                   0                1
## 296               3                   0                0
## 297               3                   1                0
## 298               2                   1                1
## 299               3                   1                0
## 300               2                   1                1
## 301               2                   1                1
## 302               3                   0                1
## 303               2                   1                1
## 304               3                   0                1
## 305               3                   0                1
## 306               2                   1                1
## 307               4                   0                0
## 308               3                   1                0
## 309               3                   0                1
## 310               3                   1                1
## 311               4                   0                0
## 312               3                   0                0
## 313               4                   0                0
## 314               2                   1                0
## 315               2                   0                1
## 316               3                   0                1
## 317               2                   0                0
## 318               2                   1                1
## 319               2                   1                1
## 320               2                   0                0
## 321               3                   0                0
## 322               4                   0                0
## 323               3                   0                1
## 324               3                   0                1
## 325               3                   0                1
## 326               2                   1                0
## 327               2                   1                1
## 328               2                   1                1
## 329               4                   1                0
## 330               3                   0                0
## 331               3                   1                1
## 332               4                   1                0
## 333               2                   1                1
## 334               4                   0                0
## 335               2                   0                0
## 336               3                   0                1
## 337               4                   0                0
## 338               4                   0                0
## 339               2                   1                1
## 340               3                   1                0
## 341               3                   0                1
## 342               4                   1                0
## 343               3                   0                1
## 344               3                   0                1
## 345               4                   1                0
## 346               2                   0                1
## 347               3                   0                0
## 348               4                   0                0
## 349               3                   0                1
## 350               3                   0                1
## 351               4                   0                0
## 352               4                   0                0
## 353               3                   0                1
## 354               4                   0                0
## 355               2                   0                0
## 356               3                   1                0
## 357               4                   0                0
## 358               4                   0                0
## 359               2                   1                1
## 360               3                   0                1
## 361               4                   0                0
## 362               2                   1                1
## 363               2                   1                0
## 364               3                   1                1
## 365               4                   0                0
## 366               4                   0                0
## 367               2                   1                1
## 368               4                   0                0
## 369               4                   0                0
## 370               2                   1                1
## 371               4                   0                0
## 372               2                   1                1
## 373               3                   1                1
## 374               2                   0                1
## 375               4                   0                0
## 376               4                   0                1
## 377               3                   1                0
## 378               2                   1                1
## 379               2                   0                0
## 380               3                   0                1
## 381               2                   1                0
## 382               4                   0                0
## 383               3                   1                0
## 384               2                   1                1
## 385               4                   1                0
## 386               4                   0                0
## 387               4                   0                0
## 388               3                   0                1
## 389               4                   0                0
## 390               2                   1                1
## 391               3                   0                1
## 392               3                   0                1
## 393               3                   1                0
## 394               4                   0                0
## 395               3                   0                1
## 396               3                   0                1
## 397               3                   1                1
## 398               2                   1                0
## 399               4                   0                0
## 400               4                   0                1
## 401               4                   0                0
## 402               3                   0                1
## 403               4                   0                0
## 404               3                   0                1
## 405               4                   1                0
## 406               2                   1                1
## 407               4                   0                0
## 408               4                   0                0
## 409               3                   1                1
## 410               2                   0                1
## 411               4                   0                0
## 412               3                   1                0
## 413               3                   0                0
## 414               4                   0                0
## 415               3                   0                1
## 416               2                   0                1
## 417               3                   0                1
## 418               4                   0                0
## 419               4                   1                0
## 420               4                   1                0
## 421               4                   1                0
## 422               2                   0                1
## 423               2                   1                1
## 424               3                   0                1
## 425               2                   1                1
## 426               4                   0                0
## 427               4                   0                0
## 428               4                   0                0
## 429               3                   0                1
## 430               3                   1                0
## 431               4                   0                0
## 432               4                   0                0
## 433               3                   0                1
## 434               2                   1                1
## 435               4                   0                0
## 436               4                   0                0
## 437               2                   1                1
## 438               2                   0                1
## 439               3                   0                0
## 440               2                   0                1
## 441               2                   1                1
## 442               2                   0                0
## 443               3                   1                0
## 444               2                   0                1
## 445               4                   0                0
## 446               3                   0                1
## 447               4                   0                0
## 448               2                   1                1
## 449               2                   0                1
## 450               3                   0                0
## 451               2                   0                1
## 452               4                   0                0
## 453               4                   0                0
## 454               3                   0                0
## 455               3                   0                1
## 456               2                   1                0
## 457               4                   0                1
## 458               3                   1                0
## 459               4                   0                1
## 460               3                   0                1
## 461               2                   0                0
## 462               4                   0                0
## 463               3                   1                1
## 464               3                   0                1
## 465               3                   0                1
## 466               4                   0                0
## 467               2                   1                1
## 468               2                   1                1
## 469               3                   0                0
## 470               4                   0                0
## 471               2                   1                0
## 472               4                   0                1
## 473               4                   0                0
## 474               2                   0                0
## 475               3                   0                0
## 476               3                   0                0
## 477               4                   0                0
## 478               4                   0                0
## 479               2                   1                1
## 480               4                   0                0
## 481               4                   0                0
## 482               4                   0                0
## 483               4                   1                0
## 484               2                   1                1
## 485               2                   0                1
## 486               4                   0                0
## 487               3                   0                0
## 488               4                   0                0
## 489               3                   0                1
## 490               4                   1                0
## 491               3                   0                0
## 492               4                   1                0
## 493               2                   1                0
## 494               4                   0                0
## 495               4                   0                1
## 496               3                   0                1
## 497               3                   0                1
## 498               4                   1                0
## 499               3                   0                1
## 500               4                   0                0
##     PROBABILITY_INCORRECT
## 1                       0
## 2                       0
## 3                       0
## 4                       0
## 5                       0
## 6                       0
## 7                       0
## 8                       0
## 9                       1
## 10                      1
## 11                      1
## 12                      0
## 13                      0
## 14                      0
## 15                      1
## 16                      0
## 17                      0
## 18                      1
## 19                      1
## 20                      0
## 21                      0
## 22                      1
## 23                      0
## 24                      0
## 25                      0
## 26                      0
## 27                      1
## 28                      0
## 29                      0
## 30                      1
## 31                      0
## 32                      1
## 33                      0
## 34                      0
## 35                      0
## 36                      0
## 37                      1
## 38                      0
## 39                      1
## 40                      0
## 41                      0
## 42                      0
## 43                      0
## 44                      0
## 45                      0
## 46                      1
## 47                      0
## 48                      0
## 49                      0
## 50                      0
## 51                      0
## 52                      0
## 53                      0
## 54                      1
## 55                      0
## 56                      0
## 57                      0
## 58                      0
## 59                      0
## 60                      0
## 61                      1
## 62                      1
## 63                      0
## 64                      0
## 65                      1
## 66                      1
## 67                      0
## 68                      0
## 69                      0
## 70                      0
## 71                      1
## 72                      0
## 73                      0
## 74                      0
## 75                      0
## 76                      1
## 77                      0
## 78                      0
## 79                      1
## 80                      0
## 81                      0
## 82                      0
## 83                      0
## 84                      0
## 85                      1
## 86                      1
## 87                      0
## 88                      0
## 89                      0
## 90                      0
## 91                      0
## 92                      1
## 93                      0
## 94                      0
## 95                      0
## 96                      0
## 97                      1
## 98                      1
## 99                      0
## 100                     1
## 101                     0
## 102                     1
## 103                     0
## 104                     0
## 105                     1
## 106                     1
## 107                     0
## 108                     1
## 109                     0
## 110                     1
## 111                     0
## 112                     0
## 113                     0
## 114                     0
## 115                     0
## 116                     0
## 117                     0
## 118                     0
## 119                     0
## 120                     0
## 121                     0
## 122                     0
## 123                     1
## 124                     0
## 125                     0
## 126                     0
## 127                     0
## 128                     0
## 129                     0
## 130                     0
## 131                     0
## 132                     0
## 133                     0
## 134                     0
## 135                     0
## 136                     0
## 137                     0
## 138                     0
## 139                     0
## 140                     0
## 141                     0
## 142                     0
## 143                     0
## 144                     0
## 145                     1
## 146                     1
## 147                     1
## 148                     0
## 149                     0
## 150                     0
## 151                     0
## 152                     0
## 153                     1
## 154                     0
## 155                     1
## 156                     1
## 157                     0
## 158                     0
## 159                     0
## 160                     0
## 161                     1
## 162                     1
## 163                     0
## 164                     0
## 165                     1
## 166                     0
## 167                     0
## 168                     1
## 169                     0
## 170                     1
## 171                     0
## 172                     0
## 173                     1
## 174                     1
## 175                     0
## 176                     1
## 177                     0
## 178                     0
## 179                     0
## 180                     0
## 181                     0
## 182                     1
## 183                     0
## 184                     0
## 185                     0
## 186                     0
## 187                     1
## 188                     0
## 189                     0
## 190                     0
## 191                     0
## 192                     0
## 193                     0
## 194                     0
## 195                     0
## 196                     0
## 197                     0
## 198                     1
## 199                     1
## 200                     0
## 201                     1
## 202                     0
## 203                     1
## 204                     1
## 205                     0
## 206                     0
## 207                     0
## 208                     0
## 209                     0
## 210                     0
## 211                     1
## 212                     0
## 213                     1
## 214                     0
## 215                     0
## 216                     1
## 217                     0
## 218                     0
## 219                     1
## 220                     0
## 221                     0
## 222                     1
## 223                     0
## 224                     1
## 225                     0
## 226                     0
## 227                     0
## 228                     0
## 229                     0
## 230                     0
## 231                     0
## 232                     0
## 233                     0
## 234                     1
## 235                     1
## 236                     1
## 237                     0
## 238                     0
## 239                     0
## 240                     0
## 241                     0
## 242                     1
## 243                     0
## 244                     0
## 245                     0
## 246                     1
## 247                     0
## 248                     0
## 249                     0
## 250                     1
## 251                     0
## 252                     0
## 253                     0
## 254                     0
## 255                     0
## 256                     1
## 257                     0
## 258                     1
## 259                     0
## 260                     1
## 261                     0
## 262                     0
## 263                     0
## 264                     0
## 265                     0
## 266                     0
## 267                     1
## 268                     1
## 269                     0
## 270                     0
## 271                     0
## 272                     0
## 273                     0
## 274                     0
## 275                     1
## 276                     0
## 277                     0
## 278                     0
## 279                     0
## 280                     1
## 281                     1
## 282                     0
## 283                     0
## 284                     0
## 285                     1
## 286                     1
## 287                     0
## 288                     0
## 289                     0
## 290                     0
## 291                     0
## 292                     0
## 293                     1
## 294                     1
## 295                     0
## 296                     0
## 297                     1
## 298                     0
## 299                     1
## 300                     0
## 301                     0
## 302                     0
## 303                     0
## 304                     0
## 305                     0
## 306                     0
## 307                     0
## 308                     1
## 309                     1
## 310                     1
## 311                     0
## 312                     0
## 313                     0
## 314                     1
## 315                     1
## 316                     0
## 317                     0
## 318                     0
## 319                     0
## 320                     0
## 321                     0
## 322                     0
## 323                     0
## 324                     0
## 325                     0
## 326                     1
## 327                     0
## 328                     0
## 329                     1
## 330                     0
## 331                     1
## 332                     1
## 333                     0
## 334                     0
## 335                     0
## 336                     0
## 337                     0
## 338                     0
## 339                     0
## 340                     1
## 341                     0
## 342                     1
## 343                     0
## 344                     0
## 345                     1
## 346                     1
## 347                     0
## 348                     0
## 349                     0
## 350                     0
## 351                     0
## 352                     0
## 353                     1
## 354                     0
## 355                     0
## 356                     1
## 357                     0
## 358                     0
## 359                     0
## 360                     0
## 361                     0
## 362                     0
## 363                     1
## 364                     1
## 365                     0
## 366                     0
## 367                     0
## 368                     0
## 369                     0
## 370                     0
## 371                     0
## 372                     0
## 373                     1
## 374                     1
## 375                     0
## 376                     0
## 377                     0
## 378                     0
## 379                     0
## 380                     1
## 381                     1
## 382                     0
## 383                     0
## 384                     0
## 385                     1
## 386                     0
## 387                     0
## 388                     0
## 389                     0
## 390                     0
## 391                     0
## 392                     1
## 393                     1
## 394                     0
## 395                     0
## 396                     0
## 397                     1
## 398                     1
## 399                     0
## 400                     0
## 401                     0
## 402                     1
## 403                     0
## 404                     0
## 405                     1
## 406                     0
## 407                     0
## 408                     0
## 409                     1
## 410                     1
## 411                     0
## 412                     1
## 413                     0
## 414                     0
## 415                     0
## 416                     1
## 417                     0
## 418                     0
## 419                     1
## 420                     1
## 421                     1
## 422                     1
## 423                     0
## 424                     0
## 425                     0
## 426                     0
## 427                     0
## 428                     0
## 429                     0
## 430                     1
## 431                     0
## 432                     0
## 433                     0
## 434                     0
## 435                     0
## 436                     0
## 437                     0
## 438                     1
## 439                     0
## 440                     1
## 441                     0
## 442                     0
## 443                     1
## 444                     1
## 445                     0
## 446                     0
## 447                     0
## 448                     0
## 449                     1
## 450                     0
## 451                     1
## 452                     0
## 453                     0
## 454                     0
## 455                     0
## 456                     1
## 457                     0
## 458                     1
## 459                     0
## 460                     0
## 461                     0
## 462                     0
## 463                     0
## 464                     0
## 465                     0
## 466                     0
## 467                     0
## 468                     0
## 469                     0
## 470                     0
## 471                     1
## 472                     0
## 473                     0
## 474                     0
## 475                     0
## 476                     0
## 477                     0
## 478                     0
## 479                     0
## 480                     0
## 481                     0
## 482                     0
## 483                     1
## 484                     0
## 485                     1
## 486                     0
## 487                     0
## 488                     0
## 489                     0
## 490                     1
## 491                     0
## 492                     1
## 493                     1
## 494                     0
## 495                     0
## 496                     0
## 497                     0
## 498                     1
## 499                     1
## 500                     0
analytics@ensemble_summary # SUMMARY OF ENSEMBLE PRECISION/COVERAGE. USES THE n VARIABLE PASSED INTO create_analytics()
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00              0.66
## n >= 2                1.00              0.66
## n >= 3                0.72              0.79
## n >= 4                0.39              0.82
#CONFUSION MATRIX
yhat = as.matrix(analytics@document_summary$CONSENSUS_CODE)
y = flag[(n/2+1):n]
print(table(y,yhat))
##    yhat
## y     0   1
##   0 234  17
##   1 152  97

Grading Text

In recent years, the SAT exams added a new essay section. While the test aimed at assessing original writing, it also introduced automated grading. A goal of the test is to assess the writing level of the student. This is associated with the notion of readability.

Readability

“Readability” is a metric of how easy it is to comprehend text. Given a goal of efficient markets, regulators want to foster transparency by making sure financial documents that are disseminated to the investing public are readable. Hence, metrics for readability are very important and are recently gaining traction.

Gunning-Fog Index

Gunning (1952) developed the Fog index. The index estimates the years of formal education needed to understand text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The index is based on the idea that poor readability is associated with longer sentences and complex words. Complex words are those that have more than two syllables. The formula for the Fog index is

\[ 0.4 \cdot \left[\frac{\mbox{\#words}}{\mbox{\#sentences}} + 100 \cdot \left( \frac{\mbox{\#complex words}}{\mbox{\#words}} \right) \right] \]

Alternative readability scores use similar ideas. The Flesch Reading Ease Score and the Flesch-Kincaid Grade Level also use counts of words, syllables, and sentences. See http://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests. The Flesch Reading Ease Score is defined as

\[ 206.835 - 1.015 \left(\frac{\mbox{\#words}}{\mbox{\#sentences}}\right) - 84.6 \left( \frac{\mbox{\#syllables}}{\mbox{\#words}} \right) \]

With a range of 90-100 easily accessible by a 11-year old, 60-70 being easy to understand for 13-15 year olds, and 0-30 for university graduates.

The Flesch-Kincaid Grade Level

This is defined as

\[ 0.39 \left(\frac{\mbox{\#words}}{\mbox{\#sentences}}\right) + 11.8 \left( \frac{\mbox{\#syllables}}{\mbox{\#words}} \right) -15.59 \]

which gives a number that corresponds to the grade level. As expected these two measures are negatively correlated. Various other measures of readability use the same ideas as in the Fog index. For example the Coleman and Liau (1975) index does not even require a count of syllables, as follows:

\[ CLI = 0.0588 L - 0.296 S - 15.8 \]

where \(L\) is the average number of letters per hundred words and \(S\) is the average number of sentences per hundred words.

Standard readability metrics may not work well for financial text. Loughran and McDonald (2014) find that the Fog index is inferior to simply looking at 10-K file size.

References

M. Coleman and T. L. Liau. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology 60, 283-284.

T. Loughran and W. McDonald, (2014). Measuring readability in financial disclosures, The Journal of Finance 69, 1643-1671.

The koRpus package

R package koRpus for readability scoring here. http://www.inside-r.org/packages/cran/koRpus/docs/readability

First, let’s grab some text from my web site.

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:qdap':
## 
##     %>%
## The following object is masked from 'package:XML':
## 
##     xml
url = "http://srdas.github.io/bio-candid.html"

doc.html = read_html(url)
text = doc.html %>% html_nodes("p") %>% html_text()

text = gsub("[\t\n]"," ",text)
text = gsub('"'," ",text)   #removes single backslash
text = paste(text, collapse=" ")
print(text)
## [1] " Sanjiv Das: A Short Academic Life History    After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research.  He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California.  Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course.     Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse.     Sanjiv's research style is instilled with a distinct  New York state of mind  - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley.     Coastal living did a lot to mold Sanjiv, who needs to live near the ocean.  The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education -  you can check out any time you like, but you can never leave.  Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good.    "

Now we can assess it for readability.

library(koRpus)
## Warning: package 'koRpus' was built under R version 3.2.5
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:lsa':
## 
##     query
## The following object is masked from 'package:dplyr':
## 
##     query
## The following object is masked from 'package:qdap':
## 
##     SMOG
write(text,file="textvec.txt")
text_tokens = tokenize("textvec.txt",lang="en")
#print(text_tokens)
print(c("Number of sentences: ",text_tokens@desc$sentences))
## [1] "Number of sentences: " "24"
print(c("Number of words: ",text_tokens@desc$words))
## [1] "Number of words: " "446"
print(c("Number of words per sentence: ",text_tokens@desc$avg.sentc.length))
## [1] "Number of words per sentence: " "18.5833333333333"
print(c("Average length of words: ",text_tokens@desc$avg.word.length))
## [1] "Average length of words: " "4.67488789237668"

Next we generate several indices of readability, which are worth looking at.

print(readability(text_tokens))
## Hyphenation (language: en)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |=========                                                        |  15%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |===============                                                  |  24%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |======================                                           |  35%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |========================                                         |  38%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |============================                                     |  44%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |==============================                                   |  47%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |=====================================                            |  58%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |==============================================                   |  72%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |==================================================               |  78%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |====================================================             |  81%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |===============================================================  |  98%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================|  99%
  |                                                                       
  |=================================================================| 100%
## Warning: Bormuth: Missing word list, hence not calculated.
## Warning: Coleman: POS tags are not elaborate enough, can't count pronouns
## and prepositions. Formulae skipped.
## Warning: Dale-Chall: Missing word list, hence not calculated.
## Warning: DRP: Missing Bormuth Mean Cloze, hence not calculated.
## Warning: Harris.Jacobson: Missing word list, hence not calculated.
## Warning: Spache: Missing word list, hence not calculated.
## Warning: Traenkle.Bailer: POS tags are not elaborate enough, can't count
## prepositions and conjuctions. Formulae skipped.
## Warning: Note: The implementations of these formulas are still subject to validation:
##   Coleman, Danielson.Bryan, Dickes.Steiwer, ELF, Fucks, Harris.Jacobson, nWS, Strain, Traenkle.Bailer, TRI
##   Use the results with caution, even if they seem plausible!
## 
## Automated Readability Index (ARI)
##   Parameters: default 
##        Grade: 9.88 
## 
## 
## Coleman-Liau
##   Parameters: default 
##          ECP: 47% (estimted cloze percentage)
##        Grade: 10.09 
##        Grade: 10.1 (short formula)
## 
## 
## Danielson-Bryan
##   Parameters: default 
##          DB1: 7.64 
##          DB2: 48.58 
##        Grade: 9-12 
## 
## 
## Dickes-Steiwer's Handformel
##   Parameters: default 
##          TTR: 0.58 
##        Score: 42.76 
## 
## 
## Easy Listening Formula
##   Parameters: default 
##       Exsyls: 149 
##        Score: 6.21 
## 
## 
## Farr-Jenkins-Paterson
##   Parameters: default 
##           RE: 56.1 
##        Grade: >= 10 (high school) 
## 
## 
## Flesch Reading Ease
##   Parameters: en (Flesch) 
##           RE: 59.75 
##        Grade: >= 10 (high school) 
## 
## 
## Flesch-Kincaid Grade Level
##   Parameters: default 
##        Grade: 9.54 
##          Age: 14.54 
## 
## 
## Gunning Frequency of Gobbledygook (FOG)
##   Parameters: default 
##        Grade: 12.55 
## 
## 
## FORCAST
##   Parameters: default 
##        Grade: 10.01 
##          Age: 15.01 
## 
## 
## Fucks' Stilcharakteristik
##        Score: 86.88 
##        Grade: 9.32 
## 
## 
## Linsear Write
##   Parameters: default 
##   Easy words: 87 
##   Hard words: 13 
##        Grade: 11.71 
## 
## 
## Läsbarhetsindex (LIX)
##   Parameters: default 
##        Index: 40.56 
##       Rating: standard 
##        Grade: 6 
## 
## 
## Neue Wiener Sachtextformeln
##   Parameters: default 
##        nWS 1: 5.42 
##        nWS 2: 5.97 
##        nWS 3: 6.28 
##        nWS 4: 6.81 
## 
## 
## Readability Index (RIX)
##   Parameters: default 
##        Index: 4.08 
##        Grade: 9 
## 
## 
## Simple Measure of Gobbledygook (SMOG)
##   Parameters: default 
##        Grade: 12.01 
##          Age: 17.01 
## 
## 
## Strain Index
##   Parameters: default 
##        Index: 8.45 
## 
## 
## Kuntzsch's Text-Redundanz-Index
##   Parameters: default 
##  Short words: 297 
##  Punctuation: 71 
##      Foreign: 0 
##        Score: -56.22 
## 
## 
## Tuldava's Text Difficulty Formula
##   Parameters: default 
##        Index: 4.43 
## 
## 
## Wheeler-Smith
##   Parameters: default 
##        Score: 62.08 
##        Grade: > 4 
## 
## Text language: en

Text Summarization

It is really easy to write a summarizer in a few lines of code. The function below takes in a text array and does the needful. Each element of the array is one sentence of the document we wan summarized.

In the function we need to calculate how similar each sentence is to any other one. This could be done using cosine similarity, but here we use another approach, Jaccard similarity. Given two sentences, Jaccard similarity is the ratio of the size of the intersection word set divided by the size of the union set.

Jaccard Similarity

A document \(D\) is comprised of \(m\) sentences \(s_i, i=1,2,...,m\), where each \(s_i\) is a set of words. We compute the pairwise overlap between sentences using the Jaccard similarity index:

\[ J_{ij} = J(s_i, s_j) = \frac{|s_i \cap s_j|}{|s_i \cup s_j|} = J_{ji} \]

The overlap is the ratio of the size of the intersect of the two word sets in sentences \(s_i\) and \(s_j\), divided by the size of the union of the two sets. The similarity score of each sentence is computed as the row sums of the Jaccard similarity matrix.

\[ {\cal S}_i = \sum_{j=1}^m J_{ij} \]

Generating the summary

Once the row sums are obtained, they are sorted and the summary is the first \(n\) sentences based on the \({\cal S}_i\) values.

# FUNCTION TO RETURN n SENTENCE SUMMARY
# Input: array of sentences (text)
# Output: n most common intersecting sentences
text_summary = function(text, n) {
  m = length(text)  # No of sentences in input
  jaccard = matrix(0,m,m)  #Store match index
  for (i in 1:m) {
    for (j in i:m) {
      a = text[i]; aa = unlist(strsplit(a," "))
      b = text[j]; bb = unlist(strsplit(b," "))
      jaccard[i,j] = length(intersect(aa,bb))/
                          length(union(aa,bb))
      jaccard[j,i] = jaccard[i,j]
    }
  }
  similarity_score = rowSums(jaccard)
  res = sort(similarity_score, index.return=TRUE,
          decreasing=TRUE)
  idx = res$ix[1:n]
  summary = text[idx]
}

Example: Summarization

We will use a sample of text that I took from Bloomberg news. It is about the need for data scientists.

url = "dstext_sample.txt"   #You can put any text file or URL here
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=1)
print(length(text[[1]]))
## [1] 1
print("ORIGINAL TEXT")
## [1] "ORIGINAL TEXT"
print(text)
## [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver.  Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data. Data scientists were meant to be the answer to this issue. Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data. This has created a huge market for people with these skills. US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer. And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data.  It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%.  However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists. May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets. The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole. This theme of centralized vs. decentralized decision-making is one that has long been debated in the management literature.  For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience. Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs.   But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets. Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself. Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear.  He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’. But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves. One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan. They reviewed the workings of large US organisations over fifteen years from the mid-80s. What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level. Their research indicated that decentralisation pays. And technological advancement often goes hand-in-hand with decentralization. Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer. Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources. They can do it themselves, in just minutes.  The decentralization trend is now impacting on technology spending. According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling. Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable. Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands. But this approach is not necessarily always adopted. For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e. using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget. The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e. how do people actually deliver value from data assets. Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative. As ever then, the real value from data comes from asking the right questions of the data. And the right questions to ask only emerge if you are close enough to the business to see them. Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data. Which probably means that data scientists’ salaries will need to take a hit in the process."
text2 = strsplit(text,". ",fixed=TRUE)  #Special handling of the period.
text2 = text2[[1]]
print("SENTENCES")
## [1] "SENTENCES"
print(text2)
##  [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver"                                                                                                                                                     
##  [2] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
##  [3] "Data scientists were meant to be the answer to this issue"                                                                                                                                                                                                                                                             
##  [4] "Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data"                                                                   
##  [5] "This has created a huge market for people with these skills"                                                                                                                                                                                                                                                           
##  [6] "US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer"                                                                                                                                                                             
##  [7] "And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data"                                                     
##  [8] " It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%"                                                                                                                                                  
##  [9] " However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists"                                                                                            
## [10] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
## [11] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
## [12] "This theme of centralized vs"                                                                                                                                                                                                                                                                                          
## [13] "decentralized decision-making is one that has long been debated in the management literature"                                                                                                                                                                                                                          
## [14] " For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience"                                                                                                                                                           
## [15] "Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs"                                                                                                                                                                                              
## [16] "  But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets"                                                                                                  
## [17] "Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself"                                                                                                                                                                                      
## [18] "Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear"                                                                                                    
## [19] " He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’"                                                                                                                                                               
## [20] "But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves"                                                                                                                                                         
## [21] "One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan"                                                                                                                                           
## [22] "They reviewed the workings of large US organisations over fifteen years from the mid-80s"                                                                                                                                                                                                                              
## [23] "What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level"                                                                     
## [24] "Their research indicated that decentralisation pays"                                                                                                                                                                                                                                                                   
## [25] "And technological advancement often goes hand-in-hand with decentralization"                                                                                                                                                                                                                                           
## [26] "Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer"                                                                                                                                                          
## [27] "Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources"                                                                                                                                                                                                          
## [28] "They can do it themselves, in just minutes"                                                                                                                                                                                                                                                                            
## [29] " The decentralization trend is now impacting on technology spending"                                                                                                                                                                                                                                                   
## [30] "According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling"                                                                                                                         
## [31] "Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable"                                                                                                                                                            
## [32] "Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands"                                                                                       
## [33] "But this approach is not necessarily always adopted"                                                                                                                                                                                                                                                                   
## [34] "For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e"                                                                                                              
## [35] "using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget"                                                                                                                                                                                                
## [36] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"                                                                                                                                                          
## [37] "how do people actually deliver value from data assets"                                                                                                                                                                                                                                                                 
## [38] "Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative"                                                                                                                                                        
## [39] "As ever then, the real value from data comes from asking the right questions of the data"                                                                                                                                                                                                                              
## [40] "And the right questions to ask only emerge if you are close enough to the business to see them"                                                                                                                                                                                                                        
## [41] "Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data"                                                                        
## [42] "Which probably means that data scientists’ salaries will need to take a hit in the process."
print("SUMMARY")
## [1] "SUMMARY"
res = text_summary(text2,5)
print(res)
## [1] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
## [2] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
## [3] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
## [4] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"                                                                                                                                                          
## [5] "Which probably means that data scientists’ salaries will need to take a hit in the process."

Text Mining Research in Finance

In this segment we explore various text mining research in the field of finance.

  1. Lu, Chen, Chen, Hung, and Li (2010) categorize finance related textual content into three categories: (a) forums, blogs, and wikis; (b) news and research reports; and (c) content generated by firms.

  2. Extracting sentiment and other information from messages posted to stock message boards such as Yahoo!, Motley Fool, Silicon Investor, Raging Bull, etc., see Tumarkin and Whitelaw (2001), Antweiler and Frank (2004), Antweiler and Frank (2005), Das, Martinez-Jerez and Tufano (2005), Das and Chen (2007).

  3. Other news sources: Lexis-Nexis, Factiva, Dow Jones News, etc., see Das, Martinez-Jerez and Tufano (2005); Boudoukh, Feldman, Kogan, Richardson (2012).

  4. The Heard on the Street column in the Wall Street Journal has been used in work by Tetlock (2007), Tetlock, Saar-Tsechansky and Macskassay (2008); see also the use of Wall Street Journal articles by Lu, Chen, Chen, Hung, and Li (2010).

  5. Thomson-Reuters NewsScope Sentiment Engine (RNSE) based on Infonics/Lexalytics algorithms and varied data on stocks and text from internal databases, see Leinweber and Sisk (2011). Zhang and Skiena (2010) develop a market neutral trading strategy using news media such as tweets, over 500 newspapers, Spinn3r RSS feeds, and LiveJournal.

Das and Chen (Management Science 2007)

Using Twitter and Facebook for Market Prediction

  1. Bollen, Mao, and Zeng (2010) claimed that stock direction of the Dow Jones Industrial Average can be predicted using tweets with 87.6% accuracy.

  2. Bar-Haim, Dinur, Feldman, Fresko and Goldstein (2011) attempt to predict stock direction using tweets by detecting and overweighting the opinion of expert investors.

  3. Brown (2012) looks at the correlation between tweets and the stock market via several measures.

  4. Logunov (2011) uses OpinionFinder to generate many measures of sentiment from tweets.

  5. Twitter based sentiment developed by Rao and Srivastava (2012) is found to be highly correlated with stock prices and indexes, as high as 0.88 for returns.

  6. Sprenger and Welpe (2010) find that tweet bullishness is associated with abnormal stock returns and tweet volume predicts trading volume.

Polarity and Subjectivity

Zhang and Skiena (2010) use Twitter feeds and also three other sources of text: over 500 nationwide newspapers, RSS feeds from blogs, and LiveJournal blogs. These are used to compute two metrics.

\[ \begin{eqnarray*} \mbox{polarity} &=& \frac{n_{pos} - n_{neg}}{n_{pos} + n_{neg}} \\ \mbox{subjectivity} &=& \frac{n_{pos} + n_{neg}}{N} \end{eqnarray*} \]

where \(N\) is the total number of words in a text document, \(n_{pos}, n_{neg}\) are the number of positive and negative words, respectively.

Logunov (2011) uses tweets data, and applies OpinionFinder and also developed a new classifier called Naive Emoticon Classification to encode sentiment. This is an unusual and original, albeit quite intuitive use of emoticons to determine mood in text mining. If an emoticon exists, then the tweet is automatically coded with that sentiment of emotion. Four types of emoticons are considered: Happy (H), Sad (S), Joy (J), and Cry (C). Polarity is defined here as \[ \mbox{polarity} = A = \frac{n_H + n_J}{n_H + n_S + n_J + n_C} \] Values greater than 0.5 are positive. \(A\) stands for aggregate sentiment and appears to be strongly autocorrelated. Overall, prediction evidence is weak.

Commercial Products

Stock Twits

iSentium

RavenPack

Text Mining Corporate Reports

There is a proliferation of word-weighting schemes.The idea of ``inverse document frequency’’ (\(idf\)) as a weighting coefficient. Hence, the \(idf\) for word \(j\) would be

\[ w_j^{idf} = \ln \left( \frac{N}{df_j} \right) \] where \(N\) is the total number of documents, and \(df_j\) is the number of documents containing word \(j\). This scheme was proposed by Manning and Schutze (1999).

Tone

Using the MD&A

Readability of Financial Reports

IBM’s Midas System

Corporate Finance and Risk Management

  1. Sprenger (2011) integrates data from text classification of tweets, user voting, and a proprietary stock game to extract the bullishness of online investors; these ideas are behind the site http://TweetTrader.net.

  2. Tweets also pose interesting problems of big streaming data discussed in Pervin, Fang, Datta, and Dutta (2013).

  3. Data used here is from filings such as 10-Ks, etc., (Loughran and McDonald (2011); Burdick et al (2011); Bodnaruk, Loughran, and McDonald (2013); Jegadeesh and Wu (2013); Loughran and McDonald (2014)).

Predicting Markets

  1. Wysocki (1999) found that for the 50 top firms in message posting volume on Yahoo! Finance, message volume predicted next day abnormal stock returns. Using a broader set of firms, he also found that high message volume firms were those with inflated valuations (relative to fundamentals), high trading volume, high short seller activity (given possibly inflated valuations), high analyst following (message posting appears to be related to news as well, correlated with a general notion of “attention” stocks), and low institutional holdings (hence broader investor discussion and interest), all intuitive outcomes.

  2. Bagnoli, Beneish, and Watts (1999) examined earnings “whispers”, unofficial crowd-sourced forecasts of quarterly earnings from small investors, are more accurate than that of First Call analyst forecasts.

  3. Tumarkin and Whitelaw (2001) examined self-reported sentiment on the Raging Bull message board and found no predictive content, either of returns or volume.

Bullishness Index

Antweiler and Frank (2004) used the Naive Bayes algorithm for classification, implemented in the {Rainbow} package of Andrew McCallum (1996). They also repeated the same using Support Vector Machines (SVMs) as a robustness check. Both algorithms generate similar empirical results. Once the algorithm is trained, they use it out-of-sample to sign each message as \(\{Buy, Hold, Sell\}\). Let \(n_B, n_S\) be the number of buy and sell messages, respectively. Then \(R = n_B/n_S\) is just the ration of buy to sell messages. Based on this they define their bullishness index

\[ B = \frac{n_B - n_S}{n_B + n_S} = \frac{R-1}{R+1} \in (-1,+1) \]

This metric is independent of the number of messages, i.e., is homogenous of degree zero in \(n_B,n_S\). An alternative measure is also proposed, i.e.,

\[ \begin{eqnarray*} B^* &=& \ln\left[\frac{1+n_B}{1+n_S} \right] \\ &=& \ln\left[\frac{1+R(1+n_B+n_S)}{1+R+n_B+n_S} \right] \\ &=& \ln\left[\frac{2+(n_B+n_S)(1+B)}{2+(n_B+n_S)(1-B)} \right] \\ & \approx & B \cdot \ln(1+n_B+n_S) \end{eqnarray*} \]

This measure takes the bullishness index \(B\) and weights it by the number of messages of both categories. This is homogenous of degree between zero and one. And they also propose a third measure, which is much more direct, i.e.,

\[ B^{**} = n_B - n_S = (n_B+n_S) \cdot \frac{R-1}{R+1} = M \cdot B \]

which is homogenous of degree one, and is a message weighted bullishness index. They prefer to use \(B^*\) in their algorithms as it appears to deliver the best predictive results. Finally, produce an agreement index,

\[ A = 1 - \sqrt{1-B^2} \in (0,1) \]

Note how closely this is related to the disagreement index seen earlier.

Possibile Applications for Finance Firms

An illustrative list of applications for finance firms is as follows:

What is LSA?

Latent Semantic Analysis (LSA) is an approach for reducing the dimension of the Term-Document Matrix (TDM), or the corresponding Document-Term Matrix (DTM), in general used interchangeably, unless a specific one is invoked. Dimension reduction of the TDM offers two benefits:

How is LSA implemented using SVD?

LSA is the application of Singular Value Decomposition (SVD) to the TDM, extracted from a text corpus. Define the TDM to be a matrix \(M \in {\cal R}^{m \times n}\), where \(m\) is the number of terms and \(n\) is the number of documents.

The SVD of matrix \(M\) is given by \[ M = T \cdot S \cdot D^\top \] where \(T \in {\cal R}^{m \times n}\) and \(D \in {\cal R}^{n \times n}\) are orthonormal to each other, and \(S \in {\cal R}^{n \times n}\) is the “singluar values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.

Example

Create a temporary directory and add some documents to it. This is a modification of the example in the lsa package

system("mkdir D")
write( c("blue", "red", "green"), file=paste("D", "D1.txt", sep="/"))
write( c("black", "blue", "red"), file=paste("D", "D2.txt", sep="/"))
write( c("yellow", "black", "green"), file=paste("D", "D3.txt", sep="/"))
write( c("yellow", "red", "black"), file=paste("D", "D4.txt", sep="/"))

Create a TDM using the textmatrix function.

library(lsa)
tdm = textmatrix("D",minWordLength=1)
print(tdm)
##         docs
## terms    D1.txt D2.txt D3.txt D4.txt
##   blue        1      1      0      0
##   green       1      0      1      0
##   red         1      1      0      1
##   black       0      1      1      1
##   yellow      0      0      1      1

Remove the extra directory.

system("rm -rf D")

So, what does SVD do?

SVD tries to connect the correlation matrix of terms (\(M \cdot M^\top\)) with the correlation matrix of documents (\(M^\top \cdot M\)) through the singular matrix.

To see this connection, note that matrix \(T\) contains the eigenvectors of the correlation matrix of terms. Likewise, the matrix \(D\) contains the eigenvectors of the correlation matrix of documents. To see this, let’s compute

et = eigen(tdm %*% t(tdm))$vectors
print(et)
##            [,1]          [,2]        [,3]          [,4]       [,5]
## [1,] -0.3629044 -6.015010e-01 -0.06829369  3.717480e-01  0.6030227
## [2,] -0.3328695 -2.220446e-16 -0.89347008  5.551115e-16 -0.3015113
## [3,] -0.5593741 -3.717480e-01  0.31014767 -6.015010e-01 -0.3015113
## [4,] -0.5593741  3.717480e-01  0.31014767  6.015010e-01 -0.3015113
## [5,] -0.3629044  6.015010e-01 -0.06829369 -3.717480e-01  0.6030227
ed = eigen(t(tdm) %*% tdm)$vectors
print(ed)
##            [,1]      [,2]       [,3]      [,4]
## [1,] -0.4570561  0.601501 -0.5395366 -0.371748
## [2,] -0.5395366  0.371748  0.4570561  0.601501
## [3,] -0.4570561 -0.601501 -0.5395366  0.371748
## [4,] -0.5395366 -0.371748  0.4570561 -0.601501

Dimension reduction of the TDM via LSA

If we wish to reduce the dimension of the latent semantic space to \(k < n\) then we use only the first \(k\) eigenvectors. The lsa function does this automatically.

We call LSA and ask it to automatically reduce the dimension of the TDM using a built-in function dimcalc_share.

res = lsa(tdm,dims=dimcalc_share())
print(res)
## $tk
##              [,1]          [,2]
## blue   -0.3629044 -6.015010e-01
## green  -0.3328695 -5.551115e-17
## red    -0.5593741 -3.717480e-01
## black  -0.5593741  3.717480e-01
## yellow -0.3629044  6.015010e-01
## 
## $dk
##              [,1]      [,2]
## D1.txt -0.4570561 -0.601501
## D2.txt -0.5395366 -0.371748
## D3.txt -0.4570561  0.601501
## D4.txt -0.5395366  0.371748
## 
## $sk
## [1] 2.746158 1.618034
## 
## attr(,"class")
## [1] "LSAspace"

We can see that the dimension has been reduced from \(n=4\) to \(n=2\). The output is shown for both the term matrix and the document matrix, both of which have only two columns. Think of these as the two “principal semantic components” of the TDM.

Compare the output of the LSA to the eigenvectors above to see that it is exactly that. The singular values in the ouput are connected to SVD as follows.

LSA and SVD: the connection?

First of all we see that the lsa function is nothing but the svd function in base R.

res2 = svd(tdm)
print(res2)
## $d
## [1] 2.746158 1.618034 1.207733 0.618034
## 
## $u
##            [,1]          [,2]        [,3]          [,4]
## [1,] -0.3629044 -6.015010e-01  0.06829369  3.717480e-01
## [2,] -0.3328695 -5.551115e-17  0.89347008 -3.455569e-15
## [3,] -0.5593741 -3.717480e-01 -0.31014767 -6.015010e-01
## [4,] -0.5593741  3.717480e-01 -0.31014767  6.015010e-01
## [5,] -0.3629044  6.015010e-01  0.06829369 -3.717480e-01
## 
## $v
##            [,1]      [,2]       [,3]      [,4]
## [1,] -0.4570561 -0.601501  0.5395366 -0.371748
## [2,] -0.5395366 -0.371748 -0.4570561  0.601501
## [3,] -0.4570561  0.601501  0.5395366  0.371748
## [4,] -0.5395366  0.371748 -0.4570561 -0.601501

The output here is the same as that of LSA except it is provided for \(n=4\). So we have four columns in \(T\) and \(D\) rather than two. Compare the results here to the previous two slides to see the connection.

What is the rank of the TDM?

We may reconstruct the TDM using the result of the LSA.

tdm_lsa = res$tk %*% diag(res$sk) %*% t(res$dk)
print(tdm_lsa)
##            D1.txt    D2.txt     D3.txt    D4.txt
## blue    1.0409089 0.8995016 -0.1299115 0.1758948
## green   0.4178005 0.4931970  0.4178005 0.4931970
## red     1.0639006 1.0524048  0.3402938 0.6051912
## black   0.3402938 0.6051912  1.0639006 1.0524048
## yellow -0.1299115 0.1758948  1.0409089 0.8995016

We see the new TDM after the LSA operation, it has non-integer frequency counts, but it may be treated in the same way as the original TDM. The document vectors populate a slightly different hyperspace.

LSA reduces the rank of the correlation matrix of terms \(M \cdot M^\top\) to \(n=2\). Here we see the rank before and after LSA.

library(Matrix)
## Warning: package 'Matrix' was built under R version 3.2.5
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:qdap':
## 
##     %&%
print(rankMatrix(tdm))
## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15
print(rankMatrix(tdm_lsa))
## [1] 2
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15

And LDA, what does it have to do with LSA?

It is similar to LSA, in that it seeks to find the most related words and cluster them into topics. It uses a Bayesian approach to do this, but more on that later. Here, let’s just do an example to see how we might use the topicmodels package.

#Load the package
library(topicmodels)

#Load data on news articles from Associated Press
data(AssociatedPress)
print(dim(AssociatedPress))
## [1]  2246 10473

This is a large DTM (not TDM). It has more than 10,000 terms, and more than 2,000 documents. This is very large and LDA will take some time, so let’s run it on a subset of the documents.

dtm = AssociatedPress[1:100,]
dim(dtm)
## [1]   100 10473

Now we run LDA on this data set

#Set parameters for Gibbs sampling
burnin = 4000
iter = 2000
thin = 500
seed = list(2003,5,63,100001,765)
nstart = 5
best = TRUE

#Number of topics
k = 5
#Run LDA
res <-LDA(dtm, k, method="Gibbs", control = list(nstart = nstart, seed = seed, best = best, burnin = burnin, iter = iter, thin = thin))

#Show topics
res.topics = as.matrix(topics(res))
print(res.topics)
##        [,1]
##   [1,]    5
##   [2,]    4
##   [3,]    5
##   [4,]    1
##   [5,]    1
##   [6,]    4
##   [7,]    2
##   [8,]    1
##   [9,]    5
##  [10,]    5
##  [11,]    5
##  [12,]    3
##  [13,]    1
##  [14,]    4
##  [15,]    2
##  [16,]    3
##  [17,]    1
##  [18,]    1
##  [19,]    2
##  [20,]    3
##  [21,]    5
##  [22,]    2
##  [23,]    2
##  [24,]    1
##  [25,]    2
##  [26,]    4
##  [27,]    4
##  [28,]    2
##  [29,]    4
##  [30,]    3
##  [31,]    2
##  [32,]    1
##  [33,]    4
##  [34,]    1
##  [35,]    5
##  [36,]    4
##  [37,]    1
##  [38,]    4
##  [39,]    4
##  [40,]    2
##  [41,]    2
##  [42,]    2
##  [43,]    1
##  [44,]    1
##  [45,]    5
##  [46,]    3
##  [47,]    2
##  [48,]    3
##  [49,]    1
##  [50,]    4
##  [51,]    1
##  [52,]    2
##  [53,]    3
##  [54,]    1
##  [55,]    3
##  [56,]    4
##  [57,]    4
##  [58,]    2
##  [59,]    5
##  [60,]    2
##  [61,]    2
##  [62,]    3
##  [63,]    2
##  [64,]    1
##  [65,]    2
##  [66,]    4
##  [67,]    5
##  [68,]    2
##  [69,]    4
##  [70,]    5
##  [71,]    5
##  [72,]    5
##  [73,]    2
##  [74,]    5
##  [75,]    2
##  [76,]    1
##  [77,]    1
##  [78,]    1
##  [79,]    3
##  [80,]    5
##  [81,]    1
##  [82,]    3
##  [83,]    5
##  [84,]    3
##  [85,]    3
##  [86,]    5
##  [87,]    2
##  [88,]    5
##  [89,]    2
##  [90,]    5
##  [91,]    3
##  [92,]    1
##  [93,]    1
##  [94,]    4
##  [95,]    3
##  [96,]    4
##  [97,]    4
##  [98,]    4
##  [99,]    5
## [100,]    5
#Show top terms
res.terms = as.matrix(terms(res,10))
print(res.terms)
##       Topic 1          Topic 2   Topic 3      Topic 4      Topic 5   
##  [1,] "i"              "percent" "new"        "soviet"     "police"  
##  [2,] "people"         "year"    "york"       "government" "central" 
##  [3,] "state"          "company" "expected"   "official"   "man"     
##  [4,] "years"          "last"    "states"     "two"        "monday"  
##  [5,] "bush"           "new"     "officials"  "union"      "friday"  
##  [6,] "president"      "bank"    "program"    "officials"  "city"    
##  [7,] "get"            "oil"     "california" "war"        "four"    
##  [8,] "told"           "prices"  "week"       "president"  "school"  
##  [9,] "administration" "report"  "air"        "world"      "high"    
## [10,] "dukakis"        "million" "help"       "leaders"    "national"
#Show topic probabilities
res.topicProbs = as.data.frame(res@gamma)
print(res.topicProbs)
##             V1         V2         V3         V4         V5
## 1   0.19169329 0.06070288 0.04472843 0.10223642 0.60063898
## 2   0.12149533 0.14330218 0.08099688 0.58255452 0.07165109
## 3   0.27213115 0.04262295 0.05901639 0.07868852 0.54754098
## 4   0.29571984 0.16731518 0.19844358 0.19455253 0.14396887
## 5   0.31896552 0.15517241 0.20689655 0.14655172 0.17241379
## 6   0.30360934 0.08492569 0.08492569 0.46284501 0.06369427
## 7   0.17050691 0.40092166 0.15668203 0.17050691 0.10138249
## 8   0.37142857 0.15238095 0.14285714 0.20000000 0.13333333
## 9   0.19298246 0.17543860 0.19298246 0.19298246 0.24561404
## 10  0.19879518 0.16265060 0.17469880 0.18674699 0.27710843
## 11  0.21212121 0.20202020 0.16161616 0.15151515 0.27272727
## 12  0.20143885 0.15827338 0.25899281 0.17985612 0.20143885
## 13  0.41395349 0.16279070 0.18139535 0.12558140 0.11627907
## 14  0.17948718 0.17948718 0.12820513 0.30769231 0.20512821
## 15  0.05135952 0.78247734 0.06344411 0.06042296 0.04229607
## 16  0.09770115 0.24712644 0.35632184 0.14942529 0.14942529
## 17  0.43103448 0.18103448 0.09051724 0.10775862 0.18965517
## 18  0.67857143 0.04591837 0.06377551 0.08418367 0.12755102
## 19  0.07083333 0.70000000 0.08750000 0.07500000 0.06666667
## 20  0.15196078 0.05637255 0.69117647 0.04656863 0.05392157
## 21  0.21782178 0.11881188 0.12871287 0.15841584 0.37623762
## 22  0.16666667 0.30000000 0.16666667 0.16666667 0.20000000
## 23  0.19298246 0.21052632 0.17543860 0.21052632 0.21052632
## 24  0.31775701 0.20560748 0.16822430 0.18691589 0.12149533
## 25  0.05121951 0.65121951 0.15365854 0.08536585 0.05853659
## 26  0.11740891 0.09311741 0.08502024 0.37246964 0.33198381
## 27  0.06583072 0.05956113 0.10658307 0.68338558 0.08463950
## 28  0.15068493 0.30136986 0.12328767 0.26027397 0.16438356
## 29  0.07860262 0.04148472 0.05676856 0.68995633 0.13318777
## 30  0.13968254 0.17142857 0.46031746 0.07936508 0.14920635
## 31  0.08405172 0.74784483 0.07112069 0.05172414 0.04525862
## 32  0.66137566 0.10846561 0.06349206 0.07407407 0.09259259
## 33  0.14655172 0.18103448 0.15517241 0.41379310 0.10344828
## 34  0.29605263 0.19736842 0.21052632 0.13157895 0.16447368
## 35  0.08080808 0.05050505 0.10437710 0.07070707 0.69360269
## 36  0.13333333 0.07878788 0.08484848 0.46666667 0.23636364
## 37  0.46202532 0.08227848 0.12974684 0.16139241 0.16455696
## 38  0.09442060 0.07296137 0.12017167 0.64377682 0.06866953
## 39  0.11764706 0.08359133 0.10526316 0.62538700 0.06811146
## 40  0.10869565 0.56521739 0.14492754 0.07246377 0.10869565
## 41  0.07671958 0.43650794 0.16137566 0.25396825 0.07142857
## 42  0.11445783 0.57831325 0.11445783 0.09036145 0.10240964
## 43  0.55793991 0.10944206 0.08798283 0.09442060 0.15021459
## 44  0.40939597 0.10067114 0.22818792 0.12751678 0.13422819
## 45  0.20000000 0.15121951 0.12682927 0.25853659 0.26341463
## 46  0.14828897 0.11406844 0.56653992 0.08365019 0.08745247
## 47  0.09929078 0.41134752 0.13475177 0.22695035 0.12765957
## 48  0.20129870 0.07467532 0.54870130 0.10714286 0.06818182
## 49  0.46800000 0.09600000 0.18400000 0.10400000 0.14800000
## 50  0.22955145 0.08179420 0.05013193 0.60158311 0.03693931
## 51  0.28368794 0.17730496 0.18439716 0.14893617 0.20567376
## 52  0.12977099 0.45801527 0.12977099 0.18320611 0.09923664
## 53  0.10507246 0.14492754 0.55072464 0.06884058 0.13043478
## 54  0.42647059 0.13725490 0.15196078 0.15686275 0.12745098
## 55  0.11881188 0.19801980 0.44554455 0.08910891 0.14851485
## 56  0.22857143 0.15714286 0.13571429 0.37142857 0.10714286
## 57  0.15294118 0.07058824 0.06117647 0.66823529 0.04705882
## 58  0.11494253 0.49425287 0.14367816 0.12068966 0.12643678
## 59  0.13278008 0.04979253 0.13692946 0.26556017 0.41493776
## 60  0.16666667 0.31666667 0.16666667 0.16666667 0.18333333
## 61  0.06796117 0.73786408 0.08090615 0.04854369 0.06472492
## 62  0.12680115 0.12968300 0.58213256 0.12103746 0.04034582
## 63  0.07902736 0.72948328 0.09118541 0.05471125 0.04559271
## 64  0.44285714 0.12142857 0.14285714 0.13214286 0.16071429
## 65  0.19540230 0.31034483 0.19540230 0.14942529 0.14942529
## 66  0.18518519 0.22222222 0.17037037 0.28888889 0.13333333
## 67  0.07024793 0.07851240 0.08677686 0.04545455 0.71900826
## 68  0.10181818 0.48000000 0.14909091 0.12727273 0.14181818
## 69  0.12307692 0.15384615 0.10000000 0.43076923 0.19230769
## 70  0.12745098 0.07352941 0.14215686 0.13235294 0.52450980
## 71  0.21582734 0.10791367 0.16546763 0.14388489 0.36690647
## 72  0.17560976 0.11219512 0.17073171 0.15609756 0.38536585
## 73  0.12280702 0.46198830 0.07602339 0.23976608 0.09941520
## 74  0.20535714 0.16964286 0.17857143 0.14285714 0.30357143
## 75  0.07567568 0.47027027 0.11891892 0.19459459 0.14054054
## 76  0.67310789 0.15619968 0.07407407 0.05152979 0.04508857
## 77  0.63834423 0.07189542 0.09150327 0.11546841 0.08278867
## 78  0.61504425 0.09292035 0.11946903 0.11504425 0.05752212
## 79  0.10971787 0.07523511 0.65830721 0.07210031 0.08463950
## 80  0.11111111 0.08666667 0.11111111 0.05777778 0.63333333
## 81  0.49681529 0.03821656 0.15286624 0.14437367 0.16772824
## 82  0.20111732 0.17318436 0.24022346 0.15642458 0.22905028
## 83  0.10731707 0.15609756 0.11219512 0.23902439 0.38536585
## 84  0.26016260 0.10569106 0.36585366 0.13008130 0.13821138
## 85  0.11525424 0.10508475 0.39322034 0.30508475 0.08135593
## 86  0.15454545 0.06060606 0.15757576 0.09696970 0.53030303
## 87  0.08301887 0.67924528 0.07924528 0.09433962 0.06415094
## 88  0.16666667 0.15972222 0.22916667 0.11805556 0.32638889
## 89  0.12389381 0.47787611 0.09734513 0.14159292 0.15929204
## 90  0.12389381 0.11061947 0.23008850 0.10176991 0.43362832
## 91  0.19724771 0.11009174 0.30275229 0.16972477 0.22018349
## 92  0.33854167 0.13541667 0.12500000 0.11458333 0.28645833
## 93  0.40131579 0.13815789 0.10526316 0.18421053 0.17105263
## 94  0.06930693 0.10231023 0.09240924 0.67656766 0.05940594
## 95  0.09130435 0.15000000 0.65434783 0.03043478 0.07391304
## 96  0.13370474 0.13091922 0.12256267 0.49303621 0.11977716
## 97  0.06709265 0.06070288 0.11501597 0.60383387 0.15335463
## 98  0.16438356 0.16438356 0.17808219 0.28767123 0.20547945
## 99  0.06274510 0.08235294 0.16470588 0.06666667 0.62352941
## 100 0.11627907 0.20465116 0.11162791 0.16744186 0.40000000
#Check that each term is allocated to all topics
print(rowSums(res.topicProbs))
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Note that the highest probability in each row assigns each document to a topic.

Shallow Dive into LDA

Latent Dirichlet Allocation (LDA) was created by David Blei, Andrew Ng, and Michael Jordan in 2003, see their paper titled “Latent Dirichlet Allocation” in the Journal of Machine Learning Research, pp 993–1022.

The simplest way to think about LDA is as a probability model that connects documents with words and topics. The components are:

Next, we connect the above objects to \(K\) topics, indexed by \(l\), i.e., \(t_l\). We will see that LDA is encapsulated in two matrices: Matrix \(A\) and Matrix \(B\).

Matrix \(A\): Connecting Documents with Topics

Matrix \(B\): Connecting Words with Topics

Distribution of Topics in a Document

\[ p(\theta | \alpha) = \frac{\Gamma(\sum_{l=1}^K \alpha_l)}{\prod_{l=1}^K \Gamma(\alpha_l)} \; \prod_{l=1}^K \theta_l^{\alpha_l - 1} \]

where \(\Gamma(\cdot)\) is the Gamma function. - LDA thus gets its name from the use of the Dirichlet distribution, embodied in Matrix \(A\). Since the topics are latent, it explains the rest of the nomenclature. - Given \(\theta\), we sample topics from matrix \(A\) with probability \(p(t | \theta)\).

Distribution of Words and Topics for a Document

\[ p(\theta, {\bf t}, {\bf w}) = p(\theta | \alpha) \prod_{l=1}^K p(t_l | \theta) p(w_l | t_l) \]

\[ p({\bf w}) = \int p(\theta | \alpha) \left(\prod_{l=1}^K \sum_{t_l} p(t_l | \theta) p(w_l | t_l)\; \right) d\theta \]

Likelihood of the entire Corpus

\[ p(D) = \prod_{j=1}^M \int p(\theta_j | \alpha) \left(\prod_{l=1}^K \sum_{t_{jl}} p(t_l | \theta_j) p(w_l | t_l)\; \right) d\theta_j \]

Examples in Finance

Using the rvest package: Overview

The rvest package, written bu Hadley Wickham, is a powerful tool for extracting text from web pages. The package provides wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, and then manipulate, HTML and XML. The package is best illustrated with some simple examples.

Program to read a web page using the selector gadget

Here is some code to read in the slashdot web page and gather the stories currently on their headlines.

library(rvest)
url = "https://slashdot.org/"

doc.html = read_html(url)
text = doc.html %>% html_nodes(".story") %>% html_text()

text = gsub("[\t\n]","",text)
#text = paste(text, collapse=" ")
print(text[1:20])
##  [1] " Leak Shows PlayStation 4 Neo Is Expected To Have Twice The Graphics Horsepower  (hothardware.com) 18"
##  [2] " Russia Is Building a Nuclear Space Bomber  (thedailybeast.com) 111"                                  
##  [3] " Do You Have A Living Doppelgänger?  (bbc.com) 87"                                                    
##  [4] " Google Decided To Nix Its Oculus Rift Competitor  (recode.net) 36"                                   
##  [5] " Elon Musk: Autopilot Feature Was Disabled In Pennsylvania Crash  (latimes.com) 106"                  
##  [6] " Newt Gingrich Says Visiting An ISIS Or Al Qaeda Website Should Be A Felony  (techdirt.com) 271"      
##  [7] " Slashdot Asks: Would You Eat Lab-Grown Meat?  (dmarge.com) 226"                                      
##  [8] " Cybercrime Overtakes Traditional Crime In UK, Says Report  (krebsonsecurity.com) 22"                 
##  [9] " Fake Pokemon Go App On Google Play Infects Phones With Screenlocker  (arstechnica.com) 41"           
## [10] " Facebook, Twitter, and YouTube Blocked In Turkey During Reported Coup Attempt  (techcrunch.com) 141" 
## [11] " Samsung In Talks With BYD To Buy Stake In Electric-Car Maker  (bloomberg.com) 12"                    
## [12] " Facebook Makes Little Progress in Race and Gender Diversity  (theguardian.com) 167"                  
## [13] " 'Tor and Bitcoin Hinder Anti-Piracy Efforts'  (torrentfreak.com) 90"                                 
## [14] " White House Pledges $400M To Back Speedier 5G Wireless Networks  (fortune.com) 82"                   
## [15] " Comcast Expands $10 Low-Income Internet Plan  (arstechnica.com) 58"                                  
## [16] NA                                                                                                     
## [17] NA                                                                                                     
## [18] NA                                                                                                     
## [19] NA                                                                                                     
## [20] NA

Program to read a web table using the selector gadget

Sometimes we need to read a table embedded in a web page and this is also a simple exercise, which is undertaken also with rvest.

library(rvest)
url = "http://finance.yahoo.com/q?uhb=uhb2&fr=uh3_finance_vert_gs&type=2button&s=IBM"

doc.html = read_html(url)
table = doc.html %>% html_nodes("table") %>% html_table()

print(table)
## [[1]]
##                                                                X1     X2
## 1 Show all results for Tip: Use comma to separate multiple quotes Search
## 
## [[2]]
##           X1           X2
## 1 Prev Close       160.28
## 2       Open       159.90
## 3        Bid 159.13 x 200
## 4        Ask 159.44 x 500
## 
## [[3]]
##             X1              X2
## 1   52wk Range 116.90 - 173.78
## 2  Day's Range 158.50 - 159.93
## 3       Volume       4,337,764
## 4 Avg Vol (3m)       3,781,471
## 
## [[4]]
##                 X1           X2
## 1       Market Cap      153.38B
## 2  P/E Ratio (ttm)        12.11
## 3      Diluted EPS          N/A
## 4             Beta         0.78
## 5    Earnings Date          N/A
## 6 Dividend & Yield 5.60 (3.49%)
## 7 Ex-Dividend Date          N/A

Note that this code extracted all the web tables in the Yahoo! Finance page and returned each one as a list item.

Program to read a web table into a data frame

Here we take note of some Russian language sites where we want to extract forex quotes and store them in a data frame.

library(rvest)

url1 <- "http://finance.i.ua/market/kiev/?type=1"  #Buy USD
url2 <- "http://finance.i.ua/market/kiev/?type=2"  #Sell USD

doc1.html = read_html(url1)
table1 = doc1.html %>% html_nodes("table") %>% html_table()
result1 = table1[[1]]
print(head(result1))
##   Время    Курс         Сумма          Телефон
## 1 15:50 27.3500       10000 € +38 093 Показать
## 2 15:49 24.7500       10000 $ +38 093 Показать
## 3 15:44 24.7300        5000 $ +38 098 Показать
## 4 15:42  0.3845 700000 \u20bd +38 067 Показать
## 5 15:40  0.3820 200000 \u20bd +38 093 Показать
## 6 15:36 27.4000        5000 € +38 050 Показать
##                                                Район
## 1                                              Подол
## 2                                              Подол
## 3 Нивки, Святошино, Берестейская, Победы просп, Борщ
## 4                                              любой
## 5                                              Подол
## 6 Виноградарь-Щербаков а Оболонь-Нивки можу під'їхат
##                                      Комментарий
## 1                   Можно частями обменный пункт
## 2                   Можно частями обменный пункт
## 3                  могу подъехать, можно частями
## 4                          можно частями подьеду
## 5                   Можно частями обменный пункт
## 6 можу під'їхати от 1000. Части от 100 привозите
doc2.html = read_html(url2)
table2 = doc2.html %>% html_nodes("table") %>% html_table()
result2 = table2[[1]]
print(head(result2))
##   Время   Курс        Сумма          Телефон
## 1 15:50 24.780       5000 $ +38 098 Показать
## 2 15:50 24.800      10000 $ +38 093 Показать
## 3 15:50 24.780       6000 $ +38 093 Показать
## 4 15:45 24.800       5000 $ +38 099 Показать
## 5 15:43 24.770      25000 $ +38 096 Показать
## 6 15:41  0.389 50000 \u20bd +38 093 Показать
##                                                Район
## 1                                    О с о к о р к и
## 2                                              Подол
## 3                    О_с_о_к_о_р_к_и. П_о_з_н_я_к_и.
## 4 Нивки, Победы пр, Берестейская, Святошино, Борщаго
## 5                          Осокорки. Обменный пункт.
## 6                                              Подол
##                     Комментарий
## 1                 можно частями
## 2  Можно частями обменный пункт
## 3                 можно частями
## 4 могу подъехать, можно частями
## 5          можно частями от 5т.
## 6  Можно частями обменный пункт

Using the rselenium package

#Clicking Show More button Google Scholar page

library(RCurl)
library(RSelenium)
library(rvest)
library(stringr)
library(igraph)
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost" 
                      , port = 4444
                      , browserName = "firefox"
)
remDr$open()
remDr$getStatus()

Application to Google Scholar data

remDr$navigate("http://scholar.google.com")
webElem <- remDr$findElement(using = 'css selector', "input#gs_hp_tsi")
webElem$sendKeysToElement(list("Sanjiv Das", "\uE007"))
link <- webElem$getCurrentUrl()
page <- read_html(as.character(link))
citations <- page %>% html_nodes (".gs_rt2")
matched <- str_match_all(citations, "<a href=\"(.*?)\"")
scholarurl <- paste("https://scholar.google.com", matched[[1]][,2], sep="")
page <- read_html(as.character(scholarurl))
remDr$navigate(as.character(scholarurl))
authorlist <- page %>% html_nodes(css=".gs_gray") %>% html_text() # Selecting fields after CSS selector .gs_gray
authorlist <- as.data.frame(authorlist)
odd_index <- seq(1,nrow(authorlist),2) #Sorting data by even/odd indexes to form a table.
even_index <- seq (2,nrow(authorlist),2)
authornames <- data.frame(x=authorlist[odd_index,1])
papernames <- data.frame(x=authorlist[even_index,1])
pubmatrix <- cbind(authorlist,papernames)

# Building the view all link on scholar page.
a=str_split(matched, "user=")
x <- substring(a[[1]][2], 1,12)
y<- paste("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=", x, sep="")
remDr$navigate(y)

#Reading view all page to get author list:
page <- read_html(as.character(y))
z <- page %>% html_nodes (".gsc_1usr_name")

x <-lapply(z,str_extract,">[A-Z]+[a-z]+ .+<")
x<-lapply(x,str_replace, ">","")
x<-lapply(x,str_replace, "<","")

# Graph function:
bsk <- as.matrix(cbind("SR Das", unlist(x)))
bsk.network<-graph.data.frame(bsk, directed=F)
plot(bsk.network)

word2vec

See package text2vec

End Note!

Biblio at: http://srdas.github.io/Das_TextAnalyticsInFinance.pdf